Building decentralized AI on atproto

Hi, everyone. My name is Maxine. I am the co-founder and CTO of Forecast Bio. And I uh as thank you for the opportunity to speak here. It's really exciting for me to be a part of this overall community over the last uh a little bit more than a year. And I'm really excited about the possibilities that atproto has for providing us agency over our social experience, over the way that we are able to shape our own science, our own investigation and our own collective sense making, as I'm sure uh was talked about a lot at the at science um meeting uh before the main meeting for this.

So I'm gonna talk about what we're building at forecast uh to facilitate this vision of a future of AI built on atproto using the tools that are available built into the infrastructure for the protocol to be able to facilitate uh collective sharing and understanding of data sets, large data sets, and also of model weights. So, first an introduction to forecast, because it's probably a little bit out of left field to have a uh like dot bio uh domain uh show up uh at this at this setting and not in the context of like biographic, but actually biology.

So forecast uh is a um drug discovery and development company, actually. And so our vision is really focused around providing people with uh cognitive agency and recognizing that historically we've done a very bad job of developing central nervous system therapeutics in the pharmaceutical space. So, what we want to do is to be able to say look, you there is this vast space of where your mind could be at any given moment in a vast way in which your mind can evolve in its instantaneous state across this space of all the different things that a mind can be.

So you're somewhere in this vast sea of all the potential mental states that are possible. And what we want to do is to be able to produce therapeutic interventions. Um, we're starting with drugs with small molecule drugs, but we'd really like to see this expand into all types of different interventions that allow you to say, okay, no matter which spot that you're at in this space, I want to be able to move to any other spot in this space to give you control of the way that your mind operates over time. Um so we think that this overall vision provides a really like a really different way of thinking about like how drugs are developed in um in in neurology and psychiatry, where historically we've had a really difficult time bringing new tools to market.

Um, and so we are using artificial intelligence models, both on like the language side, on the image side, on the multimodal side, all of these different techniques that have really been radically transformative in every field of science and uh computation over the last few years to be able to build out this map of every cognitive state that a person could be in over time and then to understand the biology by growing mini brains in the lab and then watching the videos of their active their brain activity over time to be able to map out how pharmaceutical interventions impact these things.

So the difficulty that we run into with this is that right now in the ecosystem for AI, these models are really largely produced from a small subset of orgs, as I'm sure a lot of people in the audience are familiar with from this. And this provides a really a number of like practical and like I think for the field for science as a whole, existential problems that come from like what are the long-term impacts of having like the incentive structures that are present for individual large orgs producing these models, the way that even just having any concentration of only a small sparse subset of models makes people anchor their understanding.

So even, you know, like if you have a large number of individual startups that are sort of like building things based off of these, like say in the biotech space, if you have image models that show you biological imaging, but they're all based off of the same sort of like underlying base and like the the knowledge that you're seeing actually when you're doing scientific investigation is really, really concentrated around whatever was the semantics that was put into the specific training regime of that, right? So there's this outsized power that comes in being able to like be the one that controls that.

And there's also like an outsize effect of like sort of quashing the variability in the entire ecosystem of the way investigation works. And this is something that actually a lot of people like Anthropic has even published on on the overall effectiveness for language models in the way that we interact with text and the language that we produce. So this really this concentration of like the training the infrastructure into only a handful of players is really really deleterious for the field overall long term. But atproto is this incredible this incredible ecosystem that provides this infrastructure for solving really this exact problem of being able to have like data that is distributed, that people have ownership, real genuine ownership it's not locked into like a particular monolithic um like space uh that is like schematized in order to be like mutually understandable and exchangeable which is something that for scientific data is something that's incredibly important.

Like an image is not in the same as any other image like the actual experimental detail really matter and so being able to like bookkeep on that is incredibly important. Identity is incredibly important in this context like being able to understand where data is coming from in such a way that like doesn't lead to the same concentrating effects but allows people to like like contribute but in a way that is like thrust is built out like is currently done in the Bluesky ecosystem with like verifications and the verification graph kind of architecture that's there. And then in like building out in a way that's composable that actually has like open protocols that allow applications to interchange.

And so like this has already been really started to flourish amazingly on the side for social but for scientific data I think that people were talking about on sort of like the knowledge side at the at science meeting like is really starting to blossom and I think that there is an incredible opportunity even outside of like the social and sort of like direct text knowledge side to actually how we share interoperable just full-on data sets. So what we're building is based off of this model that like positive sum is really important for providing a meaningful way for small orgs like our startup like other biotech startups to like meaningfully be able to produce AI models that compete with the large labs.

That's done in a way that's like interoperable so that the sum total of all of our small startups in the ecosystem is something that's greater than the sum of its parts. And so this is like what we see is really like a path forward built off of the infrastructure for atproto for like a small 10 person team say to be able to like outcompete massive organizations. And that's because the network the infrastructure is the force multiplier for our efforts and training like building robust AI systems. And so what we've built a forecast over the last bit is um our own path at really like treating this for data sets.

We found this to be something that's like really important in for our own infrastructure for how we're building our like image models, video models, like multimodal models that we're using internally for our own the discovery and development programs. And so what we're like we decided to really try and put some work into making that generalizable so that we can share it with the community and have this larger buy-in around sharing data. So what data is really based on is uh the fundamental technology called web data set which is a way that um like individual large data sets can be like sharded up it's very simple it's just like tar files essentially like it like put into like shards um but because of the wrapper that you can sort of do on top of that it provides a really good like ski you can schematize things very easily by having standardized way that those files are stored and you can stream things really easily and have like sidecar manifest files that let you do like querying and other operations really quickly.

So we're trying to like build out essentially an ecosystem in which there's storage infrastructure for the actual large data sets itself that's off protocol um as like a funny thing I kind of built like a version of it that does work for PDS blobs, but like so like I think I have like a few versions of like MNIST tested as like PDS blobs on my own like maxine.science but um and then we use atproto PDS records to actually be like the index the metadata the the schemas that give like the interpretation of what each individual sample is in a given data set and then lens transformations that allow you to actually move back and forth between schematizations of individual samples.

So you can aggregate across data sets that were actually made with like different sample schemas but that are interconvertible. And so we are like like still we're uh we we have a version of this that's up on PyPI. I'm pushing the Rust SDK for this right now today. Thank you, Claude. So we're but we're really really excited to get like community feedback and sort of like iterate on the way that we've designed this. So as I was mentioning like the design of this overall is that it is like really is centrally done through code. Like this I think different than a lot of things that we see in like the app proto ecosystem and that there's really no like I mean like I I think over time it'll be nice to have like a web front end for allow to allow people to browse data sets and things like that.

But really this is something that like we've envisioned as being something that you reference and like your own Jupiter notebooks for doing data science and something that you have your autonomous like agents on Cloud Code or whatever on codecs or what have you actually be able to just like build things that like query out to our like app view that's aggregating all of the data sets that are posted in all the different fields genomics in like neural recordings of dynamic you know voltage dynamics and neurons like imaging data sets for medical imaging for biological imaging all these different things that can be posted have entries is like PDS um records that give like the metadata of the data set and what schema it's using all of that be like aggregated and filtered in real time in order to provide like people who are working on different data like fields with like real time streams of when new data gets posted out.

And then specifically like having like we have a we're building out a very similar um like trust network architecture that was inspired really by the Bluesky verification system to have it so that there can be like in a given field like trust labels that are given to individual data providers that allow people to like actually hone in on that data that's being provided by really really reliable sources for particular fields. And then when individual clients look out to what's present in the Atmosphere as far as the records like you index on like all the data that's present out in the world then the client has all the information that it needs to actually then go out to the storage mechanism and actually like pull the streaming samples for that data.

And we also the error is not depicted here but we can like proxy that connection through the app view also there's like a lot of different ways that we can plug all of these things together. But this is the overall vision is like the client is able to look out into the entire Atmosphere and say what has science done in like mouse brain MRI or something like whatever it is and like any data set that has a lens that lets you convert it to something that I can interpret as mouse brain MRI I see index records from the Atmosphere that tell me where to look for all of those data sets and that let me code genations from my own data science pipeline that let me ingest all of it streaming from out in the larger like web.

So we've built all of this off of as I mentioned sort of like not like lexicon was really specific for the like the way that proto is set up but we wanted to have our own schema system. So we have like a lightweight way of like doing this across platform right now it's sort of is like using JSON schema as an intermediary but we want to support other things for that also. But essentially like when you have a data set that you're loading out from the world you want to be able to interpret as each sample comes in like what are the fields in this actually mean.

And so it was very important for us to actually publish as records like abstractions of individual sample schemas in the same way that individual lexicons are also published like on protocol so that everybody can have like a consensus understanding of like what this data actually means semantically.

And then we've really we've used uh we built out like a just really simple interface for how like actual like when you're developing with this it works that's based off of the hugging face data sets uh API. So essentially like there's some magic underneath that allows you to resolve very similar to like the way that hugging face like labels on individual data sets would be set up but just based off of like the proto handle that's associated with where the index entries are and then automatically it just does all the things under the hood to build out like um like a PyTorch data loader that actually does all of the streaming under the hood that lets you like actually build out like batched um batched like tensors in real time or like whatever what um whatever data form that you want to see it from.

So we're very excited about like just making it as simple as possible to be able to go out and load an individual data set. And then also we have um similarly like interfaces for doing queries based off of like type conversion using our lens system. Like I want to find all the things of a particular schema. I want to find all the things that are convertible to a particular schema, et cetera. So under the hood, like I mentioned, like the the way that this has been built is like foundationally on web data set, which is very, very simple.

It's just like files later, like it's like message pack in a tar file, essentially. Like it's very, very simple, but it's very powerful in the in being able to provide um like the streaming and like well, you can build a lot on top of it pretty easily with um really little overhead. Um, like that we use like really simple decorators or um macros to be able to make it so that the developer experience is pretty simple on setting up schema types for whatever data you're working with in order to be able to publish it with like one function call to your um to your PDS for the index entry and then either like you know publish to like FTP S3 compatible storage, like whatever you're using on your particular backend for the actual large file storage PDS blobs, if you're nuts like me.

Um like the the like try and make that interoperable as much as possible. Um then also to support like a number of like binary serialization formats to make things really convenient on that end. Like right now we're using I I'm using like numpy like a byte serialization because it kind of works on the back end, but we're building that out to be like really really generalizable to like support the interop for the overall community. Um and I kind of printed at this earlier, but I really wanted to dig into this because I think it's a larger point that fit I was really apparent to me at last year's conference, I think.

And I like got ADHD distracted on like doing other stuff and like the stuff we're building in forecasts. Um, but I do think that this is overall like a very important point for the larger community and particularly for like the lexicon.community efforts of like building out standards. Like what's really powerful about data that is schematized is not just that you can look at everything that's in your particular schema, but if you actually bookkeep and part of this is like for those nerds in the audience, I'm like really deep into like applied category theory. So there's like a lot of abstract nonsense of why this is like hella cool.

But essentially like the thing that really matters more than the schema is the interconversions between schemas. And so if you can really keep track of like update and uh like view operations between two different type types, um, you actually can do even more than just understanding data that's provided to you in a given lexicon, say for app proto. Like you can actually query out and build tooling that automatically is able to ingest like things from any lexicon that has coherent like view and update operations to a specified lexicon that you care about. Um so in at data, we're really building this explicitly.

Like we have PDS records in our lexicon um that are like about like lens code that actually does interconversions between our at data schemas, and these are also like subject to the verification system so that we can have like trusted lenses that are not like doing arbitrary code injection and like all the crazy nonsense that you probably have people do if you just like have them reference arbitrary code that their claud agent poles or their open claw polls or whatever. Um so but I think that like in addition to the application for at data, which is really really cool, being able to be like, oh yeah, my lab published this schema and it's like totally insane because it's got this weird like metadata about like whether the lab technician wore axe body spray that day because like that influences the mouse behavior in crazy ways that like only our lab cares about.

Like you can actually just say, okay, it's neural recordings, I don't care. And you can sort of like just project that onto the schema that you care about. But I think similarly for the larger atproto ecosystem, this is really important because like we've seen a lot of things of like what's the right move? Is it to like have everybody for different app like different applications like make their own lexicon namespace so that you have like app to app separation, or do we want to come together and have like a lexicon.community thing where like everything of a specific type sort of centralizes around individual lexicon and you have some other way to specify what apps are putting into it.

And I think it's like a why not both kind of a situation. Like individual apps can actually make their own lexicons for the types that they're working with. But if there is a centralized like lens lexicon that is like just the abstraction around like what it is to interconvert between two app proto-lexicons, then that enables like developers building out tooling that's like, okay, I want to make a blog app and I want it to have this particular data type for posts that has this content that's specific to me. But I also want to be able to reason about leaflet and I also want to be able to reason about white wind.

And I like like if you have the ability to just define what are the lens transformations to each of those individual ones, you can automatically pull in and aggregate all of that data across many different, like even sometimes kind of divergent lexicons, as long as the lenses that like give you the view operations and the update operations to each one of those are like coherent enough for what you want to do. So I think we're trying to like build our own demo version of this for the way that app data works with its own internal schematized sample pipes.

But I think that this is like a larger like community point that could be really, really cool that I want to spend a little bit of time to hammer home because I'm like a big lensed evangelist. Um all of this put together, we're like deploying, we're in the process of like deploying out our own app view that sort of is like the canonical app view implementation for the at data ecosystem and for handling um all of the like we we're the we're publishing all of our lexicons on science.alt.dataset as sort of like the reverse DNS MSID.

Um, so that was cool move four years ago me to register alt.science. Um the overall vision of this is that like we want to have something that allows like people to have control over filtering feeds of data sets that they care about, but can like weakly cohere enough to be able to like facilitate interchange and collective sense making for large scientific data sets. Um the next step, once we've built that app data that we have that we're that we're prototyping right now in-house at forecast, is actually to move from the like social data for um being able to stream in across the entire Atmosphere, different like data sets from all over the world, but to also then move that into social training to actually go to the other side of the hugging face API around model weights.

And so we're very excited about the possibilities there of um having lexicons defining like the actual training phylogeny of model weights. So you can say, ah, I want this to be associated with the hyperparameters that it was trained with, the data sets that it uses input, the weights that it started with when I was doing this fine-tune, the code that I actually use for doing it, the evaluation metrics. And when lots of people are doing that and publishing the metadata and then also in a sidecar service, like the weights themselves, then we start building a situation where we can actually like autonomously build out uh like collectively the space of all different training trajectories that we can do for AI models, right?

It'll allows this sort of weak cohesion of all of the different like biotech startups, language model startups, people that are working on small teams, like it allows that weak coherence between those to really like amplify how much that we can do the search space that are sometimes astronomical for these things as a collective. And I think really have like a transformative impact on the way that um like distributed AI work in smaller independent teams is able to like catch up, compete with and even like outpace some of what's possible in larger labs through this. Um, so like I I like like the the summary on that is just like with like the vision I think that's very important, especially with this like semantic concentration that's present in the AI field with like the power consolidation that is happening instead of the AI field with this large with the large labs is there is like a an alternative thesis of what the future of AI model training looks like.

That's not one lab with a giant cluster, but it's instead like the community with their own independent hardware with their own independent resources and enough coherence because of the infrastructure because of the protocol to be able to actually like build synergistically off of each other's work. So this is the next phase of what we're building after at data at forecasts. Um, so stay tuned. Um, and that gives like a full picture of this of like an ecosystem that's built on like at the every level from the data sets, the like way that those data sets are able to interoperate, the training of the models, the actual publication of the weights that like builds out a larger ecosystem where we can trace where things originated from, what code was used to generate them and actually share all of those details to build off of one another.

So oh no, the AI from my slides made a goof. Oh well, that's how it goes sometimes. Um just to wrap off on the final thing. Um, like what as I mentioned at the very beginning, forecast is centrally built around our vision of providing people with cognitive agency. And for what we are working on, we are focused in our business around drug development and discovery as a means to provide that. Um, because of our research experience, I'm I'm a neuroscience by training. I did my PhD down the street from here at UCSF in neuroscience, and I did also some medical training there about it, and I care deeply about neurology and psychiatry, and this is our focus.

But I think that that overall thesis has a direct like link with some of the trends that we're seeing in our AI models, how the overall shape of the AI ecosystem is really moving us toward uh like um a possibility of a future with a lot of like homogenization that's very unhealthy for individuals' mental health for the health of our scientific discourse that really benefits enormously from the noise from the divergence of opinions that come from people not just seeing the same semantics that comes through from the model, but actually their own situated individual versions of that that come in a distributed form from individuals' experiences and the particular things that they bring to the table.

Um, and so like all of this is really shaped fundamentally from the the tools that we have, the infrastructure that we have for how these systems that we're building are created and how they interoperate. And I think that app proto overall has an incredible possibility, not just on the side of like creating a like decentralized approach to our social media data, but also a decentralized approach that in a similar way is incredibly empowering to like groups that are producing artificial intelligence, and that I think this has outsized potential to really transform the future of the way that that industry looks.

Um I think uh like I I believe fully in the promise of what is provided by the protocol work that's here and believe in like protocols, not labs for like the the future of AI development. So um that's that wraps up everything that I have uh about at data and what we're working on distributed AI forecast. And also we're higher in. So if you uh if you like solving problems, let me know. Um thank you so much, Maxine. Uh, can we get a round of applause for Maxine? Um Thanks so much. Um, I've got two mics on right now, one for the stream, one for Maxine.

Maxine, can you hear me all right? Yeah, perfectly fantastic. Um we've got about five minutes for questions, and I'm gonna take the progress the moderator to first say Maxine. You should check out what Nick Dirkinas is uh working on right now along with Blaine Cook because uh uh Nick has actually just implemented lensing into Lexicon Garden. So you can you can lens between literally any arbitrary lexicon. Um so you guys can just skip right to the next step if you'd like. Um Blaine Cook also they talked about this yesterday in an on-conference session um uh around this, and uh Blaine and Aaron um Stephen White have uh much crazier arbitrary data lensing system that they're also building.

So um all should be. Oh, I'm so excited. Yeah, it's we talked we talked about that last year, and that's so cool that that's happening. Amazing. Yeah, so it's very cool. So uh yes, any other questions that uh folks have. Everybody's minds are blown. Okay, yeah. Come on away. I I'm interested to hear like uh how you progress as far as like documentation for how the public could get involved with the things you're working on. Oh, yeah, I mean, we this has been I mean, like we have so many things. We're a very early stage startup.

We're sort of like running around, and there's a fair few of us to do like a ton of tasks. And um, so I think that like I'm very, very excited about that stuff that has happened with Claude over the last few months as far as like providing ways to get documentation out there for things. So definitely like right now, like actually we've been I've been doing a lot of build out on our AI tooling, which is called crosslink, um, which some people might have seen in the community. And so that's like been a little bit of the focus there, but definitely I'm like trying to move back over to build out some of the documentation on like how to get started with at data and particularly for people who are using um like AI agents for doing coding.

Um, like I one of the cool things that we've tried to do in crosslink is actually make it so that there's like a knowledge repository per like Git repository for the specifically for like coding agents that it can pull up like repo repo to repo. So it's something to like check out for that of like our crosslink knowledge tool. But like my hope is that I can build out some of the knowledge that's like specific to at data in itself in its own like like separate like orphan branch on Git, and then that your agent can pull from that in order to like know how to do all the things basically.

But no, it's good, it's a very good point that like the docs are extremely important. Yeah. Cool. Any other questions? Okay. Well, then we'll just say thank you again, Maxine. Uh really appreciate you being here remotely with us. Amazing. Thank you so much for having me and have a fantastic conference.