OpenAI IMO Gold, AWS S3 Vectors, MCP Server Exposés, Sovereign Clouds & the Capex Surge
Hi, everyone. Welcome to the monkey patching podcast where we go bananas about all things Math Olympiad, MCP, and more. My name is Morello joined together always by my psychic. I don't know psychic, but my cohost.
Bart:Beautiful cohost.
Murilo:Beautiful, gorgeous cohost Bart. Hey, Bart. Hey, Milo. How are you?
Bart:I'm doing good. Doing good.
Murilo:Doing good as well. Actually, I actually tested out, we discussed the last episode, about
Bart:ChatGPT agent mode, and it got released this morning. I got something like an hour before we, we just went live. Oh. Tested it out.
Murilo:You tested it out already. Any anything you can share already? Or
Bart:So what I tested it with so it's very limited, of course. I would like I had, like, a I have, like, this pitch deck, basically a PowerPoint presentation, and I asked it to improve this pitch deck. So very vague, very vague prompt. Right? It took, like, half an hour, like, a shitty long time.
Bart:It was not great, the the outcome, like but it did create a new slideshow.
Murilo:It didn't create a new one.
Bart:It did create a new slideshow.
Murilo:It was Was it, because I feel like for Gemini, I've asked it to create slides, but the slide was, like, bullet points half, and then the other half was, like, empty.
Bart:Or was, like,
Murilo:the GI
Bart:It looked quite good. But, of course, it started from an existing slideshow, and it creates something something proprietary, I
Murilo:think, something JavaScript based, and it has
Bart:a viewer now in ChatGPT to basically have a to have a slideshow. But it does a lot of web browsing. You see a bit of the process. Like, it looks like I'm not sure if it's actually working like that, but it looks like it it starts up a virtual machine. And you see a bit.
Bart:You can follow along what it's doing. So it's generating code. It's looking up references online on how to create a good pitch deck, like these these kind of things. And, like, I was not super impressed with the actual output, but also my my prompt was not very not very targeted or or informative, of course. But it actually like, it's it also gave me a bit of a feeling on how I work with Cloud Code, where you can actually say, like, look this up, improve these documents, and then it starts doing it for you, and
Murilo:you follow you can follow the the the progress.
Bart:But this is a bit more of an accessible way, like, to the chat user interface when and still in a you see a sort of a representation of this, what is happening in this virtual machine that also, like, helps you understand a bit of, like, what is the thought process here and now.
Murilo:Yeah. So thanks.
Bart:Well, I'm interested to see to test a bit more with it. Apparently, if I think it's 40 requests per cycle. I think it's a monthly cycle then. So I just used one.
Murilo:Just But
Bart:it took me half an hour or so. Yeah. But I need a lot of those in a day.
Murilo:The agent this agent is, like, it's more generic. Right? Like, it's not just for PowerPoint slides. You can also No.
Bart:It's very generic. Very generic.
Murilo:Very generic. Right? Yeah. Yeah.
Bart:And I'm and I don't actually know if if it has access to MCP because then you could like, there is an MCP server for Google Slides or something, then you could actually have more
Murilo:Yeah. Yeah. Yeah.
Bart:Could actually build Google Slides for me.
Murilo:Yeah. And the yeah. I saw that as well. I think it's gonna kill a lot of startups as well.
Bart:Potentially. Potentially. Yeah. Yeah. Potentially.
Bart:Yeah.
Murilo:Which is actually not that newer. Even when Championship GT started with the whole every time they release a new feature, we'll kill, like, couple 100 startups as well. So
Bart:Yeah. I think you need to be very strong in the domain that you're in. I have a very strong domain knowledge. Otherwise, I could just bypass by an extension on a g ChatGPT.
Murilo:Yeah. Indeed. Indeed. Yeah. Cool.
Murilo:I've been using more Cloud Code, so I should it's been yeah. I feel like I'm really trying to use it more. I feel like it's it's it's fun. It's but sometimes you can tell it tears off. Like, sometimes I tell it one thing, and then I say, okay.
Murilo:Let's just make sure it works. I'm gonna delete all the the saved files and all these things, the databases, rerun it. And then, like, Aya notes, it's not working. Or then I actually look at the code. And the previous Poems, kind of forgot.
Murilo:Like, it ignored. Like I said, I wanted to do this, this, and this, and then it just kind of found a shortcut, And then he did it, and it worked. But then when it looks like, no. No. No.
Murilo:That's not what I want. Yeah. Yeah. Yeah. You need to
Bart:find a bit of a workflow editor.
Murilo:Yeah. Indeed. But I think even if I have to go more on this specific like, I mean, I can still use AIA, like, to like, on cursor or something. But even if I have to go for cursor now, I still feel like the 80% you can just do with Claude. Like, all the, you know, the like, just just doing this and this.
Murilo:And so I'm figuring out a workflow here that works for me. It's been it's been fun. Alright.
Bart:May maybe before we jump into the news of the week, we have some holidays coming up. So for the next three weeks, we're gonna see a little bit how we do things, but we're gonna get back to that. Right?
Murilo:Yes.
Bart:Yes. There will be some updates in the coming three weeks, but maybe a slightly different format.
Murilo:We'll keep you posted. Alright. Then what we have for this week, we have Marimo. So Marimo is rolling out MoLab, a free cloud workspace where users can spin up and share its reactive Python notebooks from any browsers. MoLab is a cloud hosted Marimo notebook workspace that lets you rapidly experiment on data using Python and SQL, the team notes.
Murilo:And it's available at no cost. So Marimo, I think I I'm not sure if we talked about Marimo before.
Bart:I'm not sure if we did it on the podcast, but you and me definitely talk about it.
Murilo:Yeah. We definitely talked about it. But, basically, Marimo is like it's, let's say, Jupyter notebooks from Steroids. Version two. Yeah.
Murilo:Steroids. Like, they really rethink how notebooks work, what are the things they don't like. So for example, cells like, if you have cell b that depends on cell a and cell A updates the value, cell B would also rerun. They also Because of that, so they basically have everything reactive. You can also set up the cells in a grid format and you can have plots, and you can have sliders on the plot.
Murilo:And when you update that value, then everything else updates. So you can create dashboards as well. So I'm to see some people saying that it can replace things like Streamlit. And now they've released MoLab, which is Mo is from Marimo, I think. Yeah.
Murilo:Moe for Marimo. And the lab is like Google Colab or like laboratory. Superlab. Superlab. Indeed.
Murilo:Indeed. Indeed. Basically, it's like and that's actually they make a lot of analogies with Google Cloud. Google Cloud. Google Colab.
Murilo:Right? So basically, they want to give a very, very simple experience for people to to try out. Right? You can still add it very easily. You just have the URL, molabremote.i0/notebook/ blah blah blah, and you can add it locally.
Murilo:But then if you don't want to have the whole setup, you can just very easily just go to marimo lab or mo lab, and then just interact with all these things there. I think it's cool, a little release, and maybe people that are learning Python that are want to to try an easy setup without downloading everything, all the scaffolding and all these things. Maybe that's something you can do. One thing though, because my remote did have a web, like, how do you say, Wasm version. So, like, I could actually already share a Wasm version of my notebook, but I guess there was a lot of limitations in terms of dependencies, terms of size of the notebook.
Murilo:Right? And I think this also doesn't have that. So yeah. What do you think? Have you used memory mode before, Bart?
Bart:Not not pry like, I played around with it, but not for actual projects. But what I like especially is is is the well, it looks better than notebooks, but the reactivity really is a big, big step up. So so for people that are that don't use notebooks that much, like, if you would say, like, you have a calculation somewhere depending, like, on a variable x and x is equal to two, like and but that x value, like, is defined in a cell before the current thing that you're doing. Like, if you update x, your current cell is not updated. But then, Marimo, if you have something depending on x and you change x, like, everything updates automatically.
Bart:So you're always more or less, quote, unquote, up to date. And the UI looks a lot better,
Murilo:I think. Also, if you commit Marimo Mono books, in the in the end, it's just like Python functions. Like, so it's always a function with decorator. So nice to
Bart:Didn't know.
Murilo:Commit stuff. Yeah. Yeah. If you want to commit stuff, it's it's easier. Right?
Murilo:Like, you don't have the the metadata there. And because everything's reactive, you don't have to make sure like, the outputs are always gonna be Yeah. Reliable.
Bart:And committing notebooks is hard because it's JSON in the end. Like, I mean, JSON is hard to hard to compare.
Murilo:There's a lot of metadata. A lot of metadata.
Bart:Yeah, interesting to see what what this this cloud offering. I mean, it is, of course I what what I see this a bit is this is their commercial offering. Right? Like, they have a open source open source notebook environment, but this is their commercial offering for it. I think it's I think it's good of them to start from a really, like, a very strong notebook offering and then build a commercial offering because we've seen a lot of commercial offerings around more or less rehashed Jupyter notebooks, right, in the past.
Bart:Yeah. For sure. Really seen one of them being super successful. And they actually raised I think they raised around 5,000,000 not that long ago, end of twenty twenty four. I think this is probably one of the results coming from that.
Bart:I looked a little bit into the discussions on this. There were there are some there's still some open questions on if you look at the blog post, like, are that notebooks are public by default. You just share a link, which sounds a bit weird. Yeah. It sounds a bit like sort of security.
Bart:Like, there is access if you have the link. Yeah. But I think there are this this is the very first version, and the stuff needs to be hashed out, but it looks it's good to have a it's good to have a new a new promising notebook environment. And especially, I haven't tested this out. They state that it's also AI first.
Bart:Current notebook environments don't really allow very well to integrate with AI assisted tooling. This might be a game changer if you start something from scratch and, like, rethink the notebook. Yeah.
Murilo:I also think see
Bart:how it fits in AI.
Murilo:They start appealing more to to beginners as well, like people that just want to give this a try. They lowered the barrier, so I think it's it's gonna be fun. The only thing that I didn't like about Marimo, and this is the MOLab, is Marimo, is that back then the Versus Code integration wasn't super, super mature, let's And that meant that I couldn't use my my favorite AI friend to to so I think that was like, okay, if I need to build a browser. And I think at the time, I I wanted to well, yeah. Use case was like, okay.
Murilo:I'm using notebooks, but it's not very experimental. Like, kinda know what I wanna do. I just wanted to build a report, and I wanted to have the AI stuff. So this was a while ago. They probably changed this by now.
Murilo:But in any case, I think it's really good to see just more diversity. Right? I think a lot of people hate a notebooks as well, and I think they're trying to do something about it. So I think it's that's cool. That's cool.
Murilo:Alright. What else do we have?
Bart:OpenAI researcher Alexander Wei reveals an experimental model that solved five of six twenty twenty five international math olympiad problems, good enough for a human level gold medal. He stresses that progress here calls for going beyond the reinforcement learning paradigm of clear cut verifiable rewards, hinting at fresh reinforcement learning and scaling tricks. So this is the this is the very first time that an AI model got a gold level performance at the MedOlympiates, which is special in its own.
Murilo:Congrats to the AI. Sorry? Congrats to the AI.
Bart:Congrats to the AI. Apparently, there are six there are six questions during this during this session. It's I was able to solve five of them. It's using a new, quote, unquote, advanced model of their reasoning model. It's not gonna be public.
Bart:That's why the the model that are being that they're using, at least not for the for the coming time. Also interesting is that since, I think, this morning or yesterday evening, DeepMind also announced that they also participated, and they got exactly the same result
Murilo:Wow.
Bart:In terms of scores. They also were able to to solve five of six problems. And there's also a little bit of a kerfuffle on on how how it got announced. Okay. So they are DeepMind apparently waited for the certification to be formal, but also to have the the actual humans get a bit of a a spotlight before actually announcing that their AI model wants something.
Bart:This was apparently also asked by the organization to wait for the certification of the results, but OpenAI already announced it, like, I don't know, three, four days ago, something like that, before certification actually happened.
Murilo:Okay.
Bart:So there's a bit of
Murilo:I'm afraid I didn't care again about the the request. It just, like, you just did it. But then when you say, like, to let the people get a bit of spotlight, you mean the the competitors?
Bart:Yeah. The actual ones. Yeah.
Murilo:They didn't want to discourage. But, like so but OPI didn't listen to them. They just they just did it.
Bart:That's what it seems like. Yeah. That's what the the kerfuffle in the community is about.
Murilo:And and Google, the Google weight as well? Google will have the
Bart:same Google wait. It's Google Well, DeepMind guys. To DeepMindLab. Right?
Murilo:Yeah. DeepMindLabs. But I think also to into it, I think I I also skipped through. I think you also needed to to submit, like, the proofs. Right?
Murilo:So it's not just give the answer and if it's hallucinating or something, it's fine. They also had to it's like a human. Right? You have to explain all the steps and why this is this and this and that. So it's
Bart:So so so maybe to and I think OpenAI follows a little bit of a similar approach is that they do this this year end to end where so in natural language, where last year, they try to convert the assignments to something that is more easily readable by a system. So for example, DeepMind, last year, they they developed I think they developed LeanForward, which is a language to express these problems in. Not sure if they developed it or that they just translated it to it, but and then tried to solve those Lean translations. But here, the the approach that both OpenAI and DeepMind has taken is that they really they start from from English English problem and then end up with the actual answer.
Murilo:And what is the the language, the intermediate language, what you that you
Bart:say was? Well, that's what they tried last year. They're they're not doing that anymore this year.
Murilo:But it's like a different language for that is easier for the machine to understand? Or what
Bart:is it? Yeah. Yeah. Yeah.
Murilo:But is it English still? But it's just more detailed, or is it like completely different?
Bart:No. No. It's completely different.
Murilo:Like, I I think it's something to to express something in symbolic in symbolic terms. I see. I see.
Bart:I see. Think it's called about what DeepMind used last year is called Lean, but I'm not sure what OpenAI did.
Murilo:Very cool. Does that make you feel like the AGI is coming? The general intelligence, the true AI is coming, Bart?
Bart:Well, I think there's a whole lot of discussion, but it is a it is a big one because AI has never been able to do this to get a a good level performance. When I has been hinting that they they they developed deep think for this, that they will migrate this to mainstream at some point. Whether or not this means anything for AGI, I don't know. Whether or not this means that we're becoming a lot better at these type solving these type of problems through AI, I can think that is a reality. Right?
Murilo:Yeah. That's true. But I also think, like, what does it mean for the general person? I think, okay. It's you have a model that has better reasoning, at least for math.
Murilo:Right? That's the thing also, like, is it narrow? Is it better narrow, like, at a scope? Or I don't know. I feel like sometimes it's hard for me to like, I see these things, and I think it's really cool.
Murilo:Like, you see that the it's moving forward. Right?
Bart:Mhmm. I think that is the that is the takeaway here.
Murilo:How's this gonna get to me, you know, in a few, you know, in a few months? Like, how am
Bart:I gonna see this? Think, for us, we shouldn't participate in the Math Olympiad anymore.
Murilo:I wasn't planning on it. Because he's also saying, like, oh, it's cool that I can do this. It's like, oh, most people cannot get a gold medal in the Math Olympiad either. Right? They're the same.
Bart:What what I'm interested about is, like, is is also, like, the the development calls, but also, like, the actual run calls for this. And both OpenAI and and DeepMind are not disclosing anything on it. So I think the run cost for this the inference cost was probably huge.
Murilo:Yeah. Probably. Probably. And probably this is not gonna be open to every JGPT friend. Right?
Murilo:It's probably gonna be for the best friends.
Bart:Well, for now, yes. But I think probably two years from now, this is gonna be easy peasy.
Murilo:True. True. Yeah. That's true. I do feel like a lot of times when you see something like this and then you see that it's possible, then you see a lot of people making it accessible.
Murilo:And so indeed, I agree with you. I agree with you. Should we move on to the next one?
Bart:Yes. What do we have?
Murilo:We have agnostic researchers scanned the web and found that thousands of model context protocol servers are live and wide open, leaking to inventories, anyone who asks. And I quote, we identified a total of eighteen sixty two MCP servers exposed to the Internet. And every audited sample allowed unauthenticated access. I also took a look at this. So, basically, what they did, as I understood, they kinda did some smart search to see what servers they actually had MCP servers.
Murilo:So, like, they look for keywords or they look for routes, and they somehow fingerprinted this stuff. So they kinda kinda track what are the things that were there. And then, basically, they found out that there's a lot of stuff that is basically open. Like, they also mentioned that MCP specification was not really focused on authentication part, but then they found out there's a lot of MCP servers that basically out there in the open. So big security risks for a lot of people.
Murilo:Right?
Bart:Yeah. I think that's that's a bit the the other that they're raising that with the general search of the, quote, unquote, worldwide web, probably a very small portion of that, they found a lot of open MCP servers. Basically, by by searching they use Shoden, which is a specific search engine for this. There's they also open sourced some material too so you can reproduce this to some extent. But a lot of these MCP endpoints, they are easy to easy to find because they often follow the default.
Bart:So let's say if you use MCP SDKs, like, you get an automatic endpoint on slash MCP, for example. So if you if you scan port 80 for for slash MCP, it's probably an MCP agent. And then you you you try to connect to it, and you get you get some response back where you say it's a JSON RPC message, which is typically used by MCP. And, like, one plus one in that case is basically means you're talking to an MCP agent. Yeah.
Bart:Or an MCP server. The I think the challenge I think it's good that they're raising this concern. I think what the what the what they're mentioning is that the MCP protocol itself doesn't have anything around security. It doesn't have anything around access management. There is no authentication in place.
Bart:It is just a protocol to talk between a client and a server. It's as simple as that. I think at some point, it would be good if there's an extension to the protocol that, by default, adds something around authentication. And I understand why they didn't. Right?
Bart:Like, you you want to optimize. You want to be very good at this communication between client and server. But at the same time, what you see is you get a lot of people using this, which maybe are not experts on this. Yeah. And then, like, being very opinionated on how do I do authentication?
Bart:How do I do security around this? Because you basically need to bring that yourself, and that becomes hard if you are not an expert on it. At at the same time, while I think it's good that the original alert, I think, also, to be honest, these what was it? 800 or something MCP servers?
Murilo:862.
Bart:Yeah. If they are not secured, it's a risk. But, I mean, you have you you risk to leak some internal APIs or you or to do to to to leak internal data. But at the same time, I think of these 802,862, probably 1,008 hundreds are just, like, projects that people do themselves to get some stars on GitHub. Right?
Murilo:No. Yeah. True.
Bart:It's very easy these days to just, like, have a very simple prototype play around with it, deploy for free on whatever service. And I think a lot of these that they find are probably something like that.
Murilo:That's true.
Bart:Because I count them. At the same time, even though Auth is not simple, but any serious company that de deploys an MCP service, they're gonna think of this. Right?
Murilo:Yeah. My I would be surprised if they didn't. Right? Exposing stuff like this.
Bart:Yeah. It's a bit like saying, like, oh, we found a fast API server, but there is no authentication on this. And then, I mean, the problem is not fast API. Right? It's it's a problem.
Bart:It's the person implementing this.
Murilo:Yeah. That's true. I think but when I was reading this as well, like, I think it echoes a bit what you said, I think, last week or maybe two weeks ago, that a lot of people using these things, they don't care necessarily about privacy. Maybe that's the difference. Right?
Murilo:I feel like this is so so it's how do say it? It's so in everyone's face. Mhmm. People that don't necessarily think of security first, then maybe people just jump on it. So I think that's that's the only way that I can see.
Murilo:Like, maybe there's a group of engineers that they just wanted to move fast. They just did things without thinking it through or without consulting with the I don't know.
Bart:Surely. It's possible. Yeah.
Murilo:It's possible. But but I but I agree. I think
Bart:I think the the big incentive you have as a company deploying this or as a developer deploying this, an MCP server versus a web API? Like, if someone abuses your MCP server, you you get up you you rack up a a a big bill in tokens. Right?
Murilo:Yeah. I think that's the
Bart:a very good incentive to not just let everybody access it.
Murilo:But I think that's probably the biggest even if you don't care about the security, I think that's even just that. Right?
Bart:Like Exactly. Exactly. Save
Murilo:up a few costs there. Yeah. That's true. That's true. And maybe question for you.
Murilo:After MCP came out, I heard a lot of criticism about the protocol. But to be honest, like, I'm kinda, like, going for the ride, but I don't fully understand. Like, feel like some people are very opinionated about MCP is a bad protocol. It's bad written. It's badly motivated.
Murilo:And I was like, okay. I kinda understand some of the things you say, but I don't I don't know. Like, it's everyone's using MCP. Right?
Bart:Yeah. It's actually one of the news items we have. But I think on every protocol, you have people that say, this could be better. That could be better. I've implemented it three times
Murilo:now myself, something like that.
Bart:I think it's easy to I think it's easy to understand. I think that that that has worked a lot. There are some edge cases in the protocol that are in practice the practical terms not being used a lot. There are things missing that you need to bring your own knowledge like authentication, like these things, But what we've seen is that there has been a huge adoption, right, on MCP.
Murilo:Yeah. It's true. I feel like there's few more protocols. I even heard of ACP the other day, but there's also the agent to agent.
Bart:It's a bit of a different bit of a different I
Murilo:still feel like the MCP is the most popular one by far by far far far. So I don't know. All criticisms aside, Nick, yeah, still the most popular. Everyone's using it. Seems
Bart:Especially seeing this for, like, like, for basically, more or less tool use. Right? Agent to agent is a little bit of a different focus. We like agent to agent communication. Yeah.
Murilo:Yeah. Indeed.
Bart:And these agents can still be using MCP under the hood. But maybe a good segue here Yes. Is is that AWS has opened a developer preview of an MCP server that let foundation models turn plain English requests into live AWS CLI calls. I quote, the server provides secure access control through AWS identity and access management credentials and preconfigured API permissions, ensuring agents operate within strict guardrails. So this is cool news, actually.
Bart:This was published a few days ago, And AWS basically has their own MCP server that you can, I think, run locally? And it allows you to basically based on just English language or or, let's say, natural language, generates interactions with their CLI. And you could do stuff like and whether or it's a good idea from a maintenance perspective, but you could do stuff like, please generate this bucket for me with these permissions in that region. Or and that's maybe a bit like, maybe you'd need to then quickly get into this discussion. Maybe you need infrastructure, Scott.
Bart:But I'm gonna let that let's not go into that discussion for now. But you can also do stuff like, oh, I see that that this container is failing. Can you give me can you provide me the locks? So these type of interactions become very, very intuitive. Right?
Murilo:I think, I mean, I think if anything that is read, right, read the book is available. Read this. Read that. Yeah. So, like, that's a bit of a no brainer.
Murilo:Yeah. Because if
Bart:you very easily switch between where you would have like, you read the logs that you see, oh, it's that this resource or something wrong, and then you need to go to a different a different view in the in the web UI. But here, you can just do it in in natural language to to to basically gather information and to understand what is going wrong and to potentially fix that.
Murilo:Yeah. I think I think it's, I mean, yeah, again, I think it's a very great idea. I wouldn't go as far as creating stuff yet. That's where a bit I don't know. I feel like creating stuff, I think I was still a bit more careful.
Murilo:But if it's reading stuff, yeah, like, read away. You're gonna write code, you're gonna ask the agent to to prepare something like this. It's good to know what kind of buckets you have, what kinda what is the structure, what services you already have up, what are the logs for this if something failed. Right? Like, I think giving more information is never bad.
Murilo:So
Bart:And, apparently, it's very good for someone that that actually tested it. Apparently, it's very good also to generate syntactically correct commands. So it has some guardrails in place. I haven't looked at the code. I don't know how to do it, but they they probably some, like, this is the valid schema for CLI interactions.
Bart:So it only it only gives back actual working commands, which is which is a nice thing. Right?
Murilo:Yeah. It's already it's a good already step forward. One thing I saw, and I thought this was gonna be like that, but it's different. They also have a very simple Lambda wrapper for MCP servers as well now. After You that deploy
Bart:an MCP server there, you mean?
Murilo:Yeah. In a Lambda. So it's very, very easy. So Lambda functions in AWS is serverless functions, right, so people that are not in the know, let's say. So basically, you do have a machine that is off, and then as soon as you make the request, the machine turns on, and it just executes what you want,
Bart:then it turns off again. Without authentication.
Murilo:Without authentication. But sometimes you need the the AWS stuff. No?
Bart:Like, to
Murilo:use the Lambda. So the authentication is on the Lambda side, not on the MCV side. Right? Yeah. So they made it easy just basically if you wanna create a quick MCP server there to use these Lambda functions that kinda encapsulates everything.
Murilo:So I also saw there were some things like that, yeah, there. But use application.
Bart:Cool. But, I mean, this is another argument to say, like, MCP is very, very quickly becoming the standard. Right? Indeed. If big players like AWS choose to do the effort to expose their whole CLI interactions through MCP, I mean, that's big science.
Bart:Right? Indeed. And I think,
Murilo:like I said, like, agent to agent does exist and has existed for a while now, but I but, like, for for a while, I didn't really catch on much. I remember even when I was com I I looked into agent to agent versus MCP, And I remember even that back then, because it was when agent to agent was announced, people were saying like, yeah, Google announced this protocol, but no one's implementing it yet. So you couldn't do anything with it. So it like, okay, it's cool, cool idea. But it wasn't really something that was really impacting you as much.
Murilo:And I feel like MCP is really everywhere.
Bart:I think agent to agent is also like it's it's a use case that we see less for a developer that's developing on their laptop. Like, you're it's you're not gonna True. Maybe, like, if you're doing something in cloth and there are you have parallel agents running, maybe they even use an agent to agent between them. True. But I think MCP is very close to what everybody does because it integrates natively in your chat chip pity chat window.
Bart:Right?
Murilo:Yeah. But the thing is also MCP is flexible enough that you could have I mean, it's a different architecture. I understand. But you could also use MCP to have two agents talk to each other, like, if one agent is a tool of another agent. Right?
Bart:Yeah. Yeah.
Murilo:Yeah. So it's like it's not like if you need agents communicating with each other, you need to have agent to agent. So I feel like and that's that's the way I want to actually try a bit more the agent to agent to see what are the use cases that you definitely use agent to agent, What are the cases that you can use both, but maybe one is better than the other because it's a bit of an awkward setup. And what are the use cases that only MCP can can go? But again, back then when I looked into it, it wasn't very implemented yet, widely implemented.
Murilo:But I heard since then, I heard like in the past weeks that it is way more mature. Like, there are a lot of more places that implement agent to agent. So I'm very, yeah, very curious about that. Shall we move on to the next topic? Yes.
Murilo:Alright. So we have Pedro Kodroski argues that runaway AI data center spending is now large enough to move US macro stats and rival the railroad boom of the eighteen hundreds. He estimates that, and I quote, AI CapEx can be 2% of what US GDP was in 2025, effectively adding nearly full percentage point to growth on its own. I think what this is alluding to, and I I I glanced a bit I'll look at you a bit because I think you're you're more qualified to to to talk about this, but they're kinda seeing how much the expenditure on AI data centers is this year. And I think, again, like we mentioned, the reasoning model, they probably use a lot of energy to train, but also to do inference as well.
Murilo:They see that a lot a lot a lot a lot of money is being invested on AI data centers in The US. Right?
Bart:Yeah. So the the estimates, they are that's up to 2% of GDP. Now is capital expenditure on AI, which is huge. And if you compare that to so he's making the comparison here to railroads and and 18, which was around around 6% of GDP, Telecommunications, which was around 1% of GDP in 2020. That AI data centers are now up to 2%.
Bart:The exact percentages are a bit hard to define to define. What he's basing the analysis on is a bit like what is the total amount of sales of NVIDIA, what percentage is that of total AI related sales, and then applying an economic multiplier to it. But basically means, like, if you have these sales, if you have these data centers, you also have upstream things that are happening. There are fab labs being created, downstreams, you have workers doing things. It's more than just the cost of these goods.
Bart:Estimates are that we're now at 2% in The US on GDP, which basically and GDP for people like most people notice, but, like, it's it's a bit like a a metric that says, like, this is how big our economy is. You see it a bit like that. And 2% of the size of our of of The US economy is now going to AI data centers. I guess it is what it is. Yeah.
Bart:It is what you see with these things is that it's very it's an unprecedented thing. Right? Like, it's there's only a few things in history where you see, like, it suddenly there we have this big shift in GDP because 2% on GDP is big. Right?
Murilo:Yeah. I also saw that there was a comment somewhere. Like, I think it was from the in China that they were trying to to constrain a bit. Because even in China, they were saying, like, every China every Chinese province is now trying to build an AI data center, but they all actually need it. Right?
Murilo:Like, it's to be like, everyone's going in the right direction, but it doesn't make sense, quote, unquote. And I think the other thing they mentioned is that AI data center is is something that you have to maintain. There's also other costs long running. Right?
Bart:Like, you
Murilo:have upscale hardware. You have to make sure that everything is is in a good state. Like, it's it's not just a one off.
Bart:Well, to to me, I think that is not necessarily a difference with, like, the other examples he gives, like railroads or telecommunications. I think the difference is though that it's that this this type of hardware is typically short lived. Right? Like, if you compare it to telecommunications or railroad infrastructure, like, the lifetime of this is is is way shorter, which also, like, got some questions like what what does this do to long term GDP impact. But it is an interesting trend, and I think it's kinda confirms, like, what everybody feels like.
Bart:AI is becoming this very big thing, and we don't know exactly what to expect and what it will mean for the job markets. And but at the same time, we're we're there's already so much investment going into that. And there are some discussions as well, like like, where where that this might also have some negative side effects, like, where before you would see funding going into, like, of these big companies like Google, like Meta, like OpenAI, like XAI, funding of them going into venture capital, like, basically supporting new startups and and creating new now they invest in their own infrastructure. And because it's such a big cause, so you have this shift of of of capital flow that might also, like, have a more longer effect on the economy. There are also people that are worried that this is very much a hype and what do you do with it when the hype dies down.
Bart:But we also have, like, the the other side. Like, you you saw this a bit, like, with telecommunications in the .com bubble where there was a lot of a lot
Murilo:of a
Bart:lot of investment in telecommunications, and then the dot com bubble happened. And then a lot of these companies basically died. But it meant that you had this infrastructure still in place that was basically, you could buy it almost for free, and yet actually a lot of new startups getting this value out of it. So I think even if it is hype, it's maybe not not a loss from a value point of view.
Murilo:I see. But then do you think this could happen also with the data centers if the the the lifetime of these the technology, let's say the hardware I think
Bart:yeah. That is that is the that is the the the challenge here. Right?
Murilo:Yeah. Yeah. Well, I agree. I think in a way, it's not surprising, but on the other way, there are a lot of points here that were raised, like the the comparison and, like, the lifetime and all these things and the fact that people are moving in the right that direction, but maybe without coordination. Right?
Murilo:That's do we all need to do these things? Is there? So I don't know. It's it's it's an interesting problem that I hadn't thought about, to be honest.
Bart:Yeah. It's to me, it's it's a bit of a confirmation. Like, this is a economic confirmation. That's what we all see happening as as be becoming a very big thing in the the future world. You also see it in the economic metrics playing out.
Murilo:Indeed. Indeed. Yeah. What else do we have, Bart?
Bart:We have a Cloudlaw scholar questioning whether new sovereign cloud offerings from AWS, Microsoft, and Google genuinely shield European data from NSA REACH. They bluntly stated that, and I quote, unfortunately, none of the three hyperscalers explain how exactly their new measures reduce the risk of US government access, leaving customers to connect the dots. Interesting article I saw is passing by on on LinkedIn. It's from Dave Michel's Michaels, a professor at thought you were gonna show it. Sorry.
Bart:Professor professor in the in The UK somewhere. And he's so what we've seen it's a discussion that we've been having over the last years, and this discussion sorry. Professor at Queen Mary University of London.
Murilo:There we go.
Bart:This discussion on on sovereign clouds is a discussion we've been having over the last years, and it is intensified with all the kerfuffle between, let's say, not necessarily between US and Europe, but between US and the rest of the world. The reality is today that the big cloud providers are US providers and that in Europe, we don't really have serious contenders. We have some contenders, but for most purpose, they are not full full contenders.
Murilo:There were also, like, announcements from so even though they are American or American born, let's say, I think there were some American clouds that they saw the shift and they try to subsidiaries to cater to the European public.
Bart:So what they did is that all three of them is they they try to they try to give, basically, a a sovereign offering in Europe. And either they already created a European entity or they are in the making of creating a European European entity, basically an entity that is governed by someone in Europe, an entity in Europe, but still offers their their cloud services. The article goes a bit into the into it that the the regulations that are are in play in The US is that they if someone in The US, an employee of a US entity, has remote access to these European entities, he should would still be under the obligation to share the data if there is a if there is a whatever warrants for it. Right? Mhmm.
Bart:It's a warrant under f I s a 702 that states that this should be possible and also has reached outside of The US. The problem with that a bit is that so just having a Europe a European entity doesn't really solve anything. If you still have US employees, that potential effects to that, which will probably be the case. Right? Because development is probably most of it is in The US.
Bart:So what they do now is that they have these EKMs, which are basically external key management. So your customers have their own encryption keys. But for some purposes, like the these cloud providers will have they they ask, can we get your encryption key? Because we need to index your data, for example, because we need to make it easier to solve. So there are moments where you share this encryption key.
Bart:And at those moments, if you really like, that's what the ArQL does. It drills down. Like, the again, a US employee would have to share this encryption key without informing the the the user if it comes to again, like, there needs to be a warrant, so that's a place. But it's not there is not really the European entity doesn't really solve anything of this. Only full encryption does.
Murilo:I see. So it's like it's a solution that is there, but it's not, Let's say it's it doesn't solve the whole problem. Let's say it's still
Bart:I think it's the best solution that there is, but then at the end of the day, you're still using services by a US provider. I think that was the reality of it. And and if shit hits the fan, yours you're it still gets to you. Right?
Murilo:Yeah. But, like, do we have also because there are European clouds. Right? But from last I heard that you can do stuff, but it's still so I never used it, to be clear. But what I the feeling I get from talking to people that have used it is that it's not as mature.
Murilo:It's not as widespread. It's not some things are not as smooth. Right? But I think that will be You're you're talking
Bart:about the European providers, like European
Murilo:scale. Exactly. Yeah. Yeah. So I guess, like, that will be that will really solve the problems, I guess, from what I understand, but the experience is not the same.
Bart:It's the experience is not the same, and the offering is probably not as complete.
Murilo:Yeah. Yeah. But did did like, and when you say not as complete, does it still cover 80%, or does it cover the 40%? Like, how No. I think I think it
Bart:would still I think we're it should reach 80%.
Murilo:Okay. So then it's more like if people are motivated to do it and are okay with living with the inconveniences, it is a viable solution.
Bart:It is a viable solution, but it's it's harder. Like, for a lot of reasons, it's it's harder to do, but especially for large corporations, large enterprises, it's harder to find the right skills.
Murilo:Yeah. That's true. It's much
Bart:easier to find someone with experience on AWS than someone with with extensive experience on OVH. Right? And that's not insurmountable, but it's these are realities.
Murilo:Growing pains.
Bart:The the communities around these three major providers, which are US. The communities are huge even in the in the EU.
Murilo:Yeah. No. That's yeah. That's definitely true. That's definitely true.
Bart:So interesting one. Not sure what the what the answer to this is.
Murilo:Yeah. Me neither. Are you optimistic about like, I mean, I think also this this became an this became a talking point now because of The US, like, geopolitical things. Right? The like, maybe there was a big signal, like, maybe we shouldn't rely as much.
Murilo:But do you think there's a well, in the Microsoft AWS, it's in their interest to, how do you say, to solve quote unquote these issues. Are you optimistic that in the future we'll have a good solution? In the future, I mean, by end of the year.
Bart:I don't know. I think here it also comes down to to willingness of US regulators to allow the big three US cloud providers to to provide sovereign services to I think it really comes down to the to the to the US government at this point. And I think the other point of view, like like, will we ever have a good European contender is also difficult. Europe is a very segregated market. We've seen EU subsidy projects in the past.
Bart:We've we've seen project GAIA, which was basically the idea to create this cloud offering. Problem with these European subsidy projects in my eyes is often that it's very it's a bit too politically correct. Like, it's not, here here, we're gonna create this entity in wherever it's best geographically and we and where because the skills are there, because the knowledge is there, because the legal framework is there, it's gonna be that country, and we're gonna invest, I don't know, 2,000,000,000 in that company to grow to grow a cloud provider. No. It's we want to be competitive in a cloud provider space.
Bart:Is there anyone, a government or any companies in Europe that wants to build this? And then you get a list of 2,000 companies, and then you get subsidies spread across this whole space, and in the end, there is not a lot to show for.
Murilo:You have enough food to feed them and keep them alive, but not to really grow strong.
Bart:Yeah. So that's I think unless we see something like a very clear investment in a in a dedicated entity that needs to become the cloud provider, maybe in one of the existing ones. I think it's gonna be hard to to recontent to to be a competitor as as the European market versus the world in this world. Alright.
Murilo:Shall we move to the next topic?
Bart:Yes.
Murilo:So this is Galileo's open source agent leaderboard that pits major elements against realistic enterprise scenarios, exposing sharp gaps between tool usage and actual test completion. The July update shows that, and I quote, CHET GPT four sorry, not CHET GPT, just GPT 4.1 leads with an average action completion score 62 across all domains, while rivals like Gemini 2.5 Flash excel only narrow metrics. So this, as I understand, and then correct me if I'm wrong, Bart, Galileo is like a company for observability monitoring and all these things. And they created a leaderboard. This is the version two, and I'm showing here the git repo for people that are The version
Bart:two was released this weekend. That's why.
Murilo:It was this week. So it's very new.
Bart:Very new.
Murilo:So version one was just tool calling, and this version two also has, and I quote again, enhanced framework with synthetic dataset generation, a simulation engine for evaluation, task completion, and tool quality performance. So it's to evaluate agents, basically.
Bart:Yeah. And and a bit of the premise, it's also in the hit to read me, is that they want to have a leaderboard that is relevant for the agent domain. So they give an example like Klarna. They decided to replace 700 customer service reps with AI, completely backfired. So now they're rehiring humans.
Bart:They Klarna, they save money, but, like, the compute the customer experience, like, highly degraded. And what you want to do is to understand, like, if I want to replace someone by an agent sounds very nefarious when I put it like that. But if you want to replace a process by an agent, like, how do you know the performance before you flip that switch? Right? And that is what that v two lay leaderboard tries to do.
Bart:It's like, of testing whether an agent can call the right tools, which is more or less all the benchmarks that are out there on agents is or agent is like, can this agent call the right tools at the right moment? You also they they it it doesn't stop there. It also, like, puts the AI through, like, this real time scenarios with multi turn dialogues, a complex decision making to really understand, like, what is the performance of this agent in this domain while along the way it also calls some tools. And most of these leaderboards mainly focus on the on the on the tool calling, which you could argue is not really illustrative of real life performance.
Murilo:Yeah. I think it's not illustrative, but it is informative. Right?
Bart:It is informative.
Murilo:Yeah. Yeah. Yeah. Indeed. Maybe a quick question.
Murilo:You mentioned Klarna backfired. Do you think it will backfire for other companies as well?
Bart:If do too quickly, yes. You don't understand what what is going on. That definitely will.
Murilo:I think especially for customer facing stuff. Right?
Bart:Yeah. I think customer facing stuff is is it really depends on, like, will this if you change something, if you automate a process by using an agent, will it keep the customer experience same level or will it improve or will it degrade? Right? I think if you don't model this well, if you don't design it well, it's very easy to degrade.
Murilo:And I think it's maybe and as you're saying this, was thinking that the things that are measurable, it's easy to see that it will get better, like response time. Right? The how many answers you get per like, those things are easy to measure, and it's like, yeah. It will be better with an agent. Duh.
Murilo:But then the things that are a bit harder, like how good was the answer, how thoughtful was the answer, how patient. It's not as much in your face, but that's that's probably the core of what's a good answer.
Bart:It's hard to evaluate. Yeah.
Murilo:Yeah. It's it's
Bart:more subjective level to it. Yeah. That I agree with you.
Murilo:Indeed. And I think maybe that's also why a lot of maybe maybe. I'm not sure. I'm speculating a lot here, but maybe that's why Clarinati at first, they really were quick to pull the trick the the the plug there, but, yeah, the things that were harder to measure. You know?
Murilo:That's what
Bart:But I also think, like, it's not hard to test. Like, it's hard to define those tests, like, written out. But if you, as a person, like, you want to test you want to call Klarna to understand what your next invoice will be. Like, if you, as a person, you call and then you need to explain why you do it, and you're and it's like, okay. Then you get a link, and then you need to click it.
Bart:And versus if you talk take say the same thing to an agent, and you say you gotta immediately get the the next invoice estimation back, they're gonna probably gonna say, ah, agent's better.
Murilo:Yeah. Indeed.
Bart:If it would be revert revert, like, it's it's like it's not hard to test that as a person, what the experience is.
Murilo:Yes. But I also
Bart:think it's hard to do it at scale to because it's very subjective and to understand, like, what can everybody ask and
Murilo:but I don't think it's undoable. No. I agree. I don't think it's undoable, but I also think there's a big pool of questions or scenarios that the the what's good or bad is very subjective and or maybe it's not as easy to I do feel like there's there's a group of tasks that is not as easy to evaluate. Maybe that is very specific.
Murilo:It's like you want an invoice estimation for this, but maybe if someone is having an issue and it's a stupid issue or something and they need to debug with a person, maybe the the answers are not gonna be very helpful. Like, maybe just Yeah.
Bart:But that I agree with. Like like, things where your agent is not built to have a easy answer, you should immediately forward to a human. Because otherwise, you're you're as a customer, you're gonna get very frustrated.
Murilo:Indeed. But sometimes I feel like people call because they have a problem, but they don't even know what the problem is. And if you don't even know what the problem is, then how can you you know? And then you you really need someone to kinda and and then it comes with more human quote unquote qualities of, like, being patient and be understanding and asking the right questions. And, you know, and I think for those things, I can see how agents or if you judging if you should prepare something, they don't have a really good plan on how these things go.
Murilo:It can go downhill. Right? Yeah. So oh, yeah. You run this this leaderboard as well?
Murilo:Anything anything insightful that you got from this leaderboard as well that you didn't see in other places? Like, anything that's like, oh, wow. Now I'm gonna start using Gemini 2.5 Flash for because it's really good at this or that.
Bart:No. I was surprised to see gep GPT 4.1 topping the chart, and then actually just below the 4.1 mini, and below that is Summit four.
Murilo:So that's Yeah. That's true. That's
Bart:true. Yeah. But to me, these are these leaderboards are very interesting from the moment that you're building an agent, that you're building a tool based system to quickly also have an understanding, like, what is the latest and greatest because it probably changed from a month ago. Right? Yeah.
Bart:That's why these these leaderboards are very valuable.
Murilo:But to be honest, like, I think in the beginning when I was playing more with LLMs, it's like, which one should I start for this project? And it's like, then you really go, like, okay. Maybe, like, some research, and let's look at leaderboards. But now I feel like I'm at a point that, like, man, just pick whatever, and then probably gonna be fine. You know?
Murilo:Just pick just pick one.
Bart:Yeah. Just pick one of the the the current state of the art ones. Right?
Murilo:Exactly. Yeah. Yeah. So it's like, I think in the beginning, I was looking more at these leaderboards, but now I feel like I'm a bit it's like, just just pick one. It's fine.
Murilo:Like, don't don't waste so much brainpower. Because one, it may change tomorrow, and two, sometimes the difference between them, like, when you like, it's not that different. Like, for I mean, it is different, and I do know this. I prefer some models over others.
Bart:But I would say it's very different when it comes to tool calling. Like, the But tool calling. You need to depend on cool talk tool calling in your chain, like, the if the performance is not good, like, your whole chain can very quickly degrade.
Murilo:That I agree. But maybe I have maybe I just haven't touched as many problems like that. Right? But for example, 4.1 versus, like, Claude. Right?
Murilo:Like, would you look at a leaderboard, or would you just pick Claude?
Bart:Probably, yes, because 4.1 is very expensive. Okay.
Murilo:Okay.
Bart:But but there are like, you can also make the argument, like like, I think well, we were just discussing Europe. There are still a lot of firms that say, oh, we need to use miss Mistral because it's the only European one. If you look at the the benchmark of this, like, Mistral is very low in the benchmark. Right? Like, is that is that something like, do you want to have this lower performance just to have a European based model?
Bart:Question mark. Right? Like, you you get into these type of of discussions then. And for that, it's good to see some actual data to to understand, like, how could this be the choice I'm making here.
Murilo:Yeah. Do you even feel just, like, an informed discussion? Right? Now it's not just he said, she said. I feel like it's more like, okay.
Murilo:We can we can get on the same page here.
Bart:And to be honest, like, if I look at myself, when I evaluate models, it's very much like, oh, yeah. I feel this. I feel that. I feel
Murilo:Yeah. I did. Maybe one thing that we can look at another time, but there wasn't I listened to an on the ChangeLab podcast, actually, that they were saying that there was a paper that said that even though developers feel more productive with AI, the study showed that people are less productive with AI. Oh, yeah.
Bart:I actually saw it passing by.
Murilo:Yeah. So something that we can maybe discuss or bring in another time because I do feel like yeah. He's like, I feel he feels. But at the same time, if someone asks me, like, hey. Can I have access to this?
Murilo:Because I feel like this. I'm like, come on, man. You feel like, give me something here. You know?
Bart:Yeah. It's up to Yeah.
Murilo:But I get it. I get it. What else do we have?
Bart:Maybe a small correction. It's I would just say that 4.1 is more much more expensive than Solace. It's not true. It's 4.5 that is very much more expensive, but it's not even on the on the benchmark.
Murilo:Okay. And Jetty p t five is coming out soon, I think. No? I think I read somewhere.
Bart:A lot of rumors, but more rumors than in the last months, but we hear rumors about Jetty t five every week, I guess.
Murilo:That's how they stay relevant. It just whispers around. Shall we move on?
Bart:Yes. So AWS again.
Murilo:Sorry. Real quick. I saw where where I saw this. I read it actually on the previous article that we covered about the old sign the math Olympiad. It says here, by the way, we're releasing GPT five soon, and we're excited to try it.
Murilo:So that's that's that's another that's where I got that information there.
Bart:Let's see.
Murilo:Sorry. You were saying?
Bart:So in other news, AWS is previewing s three vectors, a new bucket type that stores embeddings directly in s three and promises big savings on vector search workloads. As AWS puts it, Amazon s three Vectors is the first cloud object store with native support to store large vector datasets and provide sub second query performance, cutting costs by up to 90%. Interesting topic.
Murilo:Yes.
Bart:So Amazon s three, probably most people have heard of. It's being used a lot to basically store files in buckets, right, with a lot of cool utility functions around that to core to to search your files, to get files, to get temporary links to files, to you can you can optimize your storage there. It's cold storage. It's hot storage to optimize your cost. So Amazon s three, very, very famous.
Bart:Now they have s three vectors, and you can look at it a little bit like a new type of bucket type specifically for vector embeddings. So vector embeddings, I think people that are in this space probably know that this is like a a high dimensional representation of a document or of an image or of whatever. Most most of the times, it's documents, the easiest to to reason about. And you could store this now in s three vector buckets. What you will often do is that you have, like, you have maybe the vector embedding for PDF a in this vector buckets, but you have the actual PDF still in another bucket, so you can make this link between the two.
Bart:They promise that there is very, very fast query performance, very cheap query performance as well, and that the storage costs are also very low, which is as we typically known for because you'll basically only pay for it when you're streaming it to there or getting getting data from there or querying it. You can choose your own embedding model. So you can say, okay. I'm gonna transform this data to an embedding, and I'm gonna store it. But they also via Amazon Bedrock, you can also use a utility there that embeds it for you.
Bart:So it's it's it's cool to see. So, typically, embedding so vector databases, they are typically more or less, quote, unquote, standard databases that are optimized to also search for embeddings to index embeddings. So think about, for example, in the simplest case, like, if a Postgres database with a vector extension, and then you can store store embeddings. And here, embeddings, it becomes sort of like and it's not, like, probably not under the hood, not exactly, but embeddings becomes just another file that you drop in your bucket. Amazon makes it very easy to to query and search these things.
Murilo:Yeah. I think it's interesting to well, I think it's nice. I'm excited about this. Let's see. I mean, performance and all these things and can you query things nicely.
Murilo:But it's something that, yeah, you saw, like, before it has stuff on a database, then you can just drop stuff on a bucket and just easily find it and all these things. Now you can do that with vectors, which I think is a natural progression, but I hadn't thought of it yet actually. So very, very cool. I also think that maybe even like the snowflakes and the other things, going to start doing something like this as well. I think the whole there's a whole as I understand, there's a whole family of platforms that separate compute and storage, right?
Murilo:To make basically things cheaper, because like you said, storage is cheap. I hadn't seen that with vectors because I think to search vectors as well, it's not as straightforward. But so this seems like the the first step there. So it's still in preview. Curious to I'm very curious to see the first use case on this.
Murilo:Like, people saying, like, ah, we did this, and we saved so much money, and this is really good, and this is that. So I think it's really, really nice. Yeah. Have you played a lot with the vector databases, Bart?
Bart:Not that much, to be honest. No.
Murilo:No. Yeah. Me neither, actually. But I would have imagined that I would have played more by now, to be honest, like, if you're playing with all the AI things. But you know?
Bart:Yeah. It's very typical for the for the rack space, of course, when you're making making rack based solutions. Yep.
Murilo:But I think aside from that, I feel like I don't know. I mean, I understand the premise and and and why it's useful, but I I I don't know. I I just imagine that if you're doing more stuff with AI, you would really you really touch into vector databases sooner or later.
Bart:When I well, my job's changed a bit over the last years, of course, but I also I was but also changes, like, content length of these models. When I first started doing this a few years ago, the contact length length was so small that you almost had to build these vector stores even for let's say, you have a large PDF of 200 pages. You need to build a small, maybe in memory, vector store for this. But now this context length has become so bad, it's not really not really an issue anymore. Right?
Bart:So it's maybe also plays into this.
Murilo:Yeah. That's true. We worked around it, right, in a lot of ways. Cool. And shall we move on to our last topic?
Bart:Yes. Let's do that.
Murilo:So this is about CLIs. Developer Ryan argues that command line tools and APIs need redesigning so LLM agents can navigate them without context window, thrashing, and or endless loops. He opens with plain reminder that, and I quote, We need to augment our command line tools and design APIs so they can be better used by LLM agents. Then shares fixes on his modern docstrings and Git wrappers. So this is an article that I came I thought it was also it resonated with a lot of the stuff that I've encountered, let's say, playing with Cloud Code.
Murilo:So one of the things that he mentions, for example, a lot of the times when he's actually using Cloud Code, you see that it actually uses head, so to see the top of a file, but then it limits the first 100 lines. And that's because the context window of these models are limited. But it's actually So Cloud Code, it learned to do it. It just knows that it can just reduce the context window like this. But for a lot of these tools, the first 100 lines is gonna be useless.
Murilo:It's not gonna help you. It's not like the top is gonna be what you're looking So actually in the end, he also talks a bit more how ways to maybe rethink a bit or maybe give a summarized view on these things and then let the model do follow-up queries after that instead of dumping everything in one place because this is not gonna help the models. Another thing that it also said here, the code clean on his project, he has a lot of pre commit hooks, so not using the pre commit package necessarily. And and this is something that I I've definitely seen is the loop of the project. So he makes a change, and then he build the projects and he's okay.
Murilo:Then he run tests and he fails, and then tries to fix the test and he cannot fix the test. And then instead of fixing the test, he just says, okay. Let's just skip the commit hooks just just to no verify, which is something that happened to me a few times. And then what I did in Claude is to just because you can kinda memorize some some instructions. Right?
Murilo:I just said just never skip this even though there's not a how do you say? It's not a strong guarantee that it won't skip. Right? It's just on the system prompt. So, actually, what he did, he actually wrapped the git command.
Murilo:So every time there's git commit no verify, it would actually fail. And it gives a more context to the agent saying, hey, you tried to skip the commit hooks. Redo this, but don't skip them. So things like that. He also went to I thought it was funny.
Murilo:After that, he said he it kind of worked, but the model started to actually try to change the pre commit hooks to delete it. So instead of actually trying to fix the code to pass the pre commit hooks, the model was trying to edit the commit hooks from the dot git slash hook slash pre commit to bypass these things. So basically, I won't have pre commit hooks issues if there are no pre commit hooks. So he actually had to change the edit settings from these models as well. And then he talked more about the information architecture.
Murilo:So basically, was in the user experience was how things are designed and how we need to basically rethink a bit. Again, another example he gives is cargo build head N100. Basically, Claude was when he was working with Rust now and when he was building the project, instead of having all the built logs, he Claude would just take the first 100 lines, which a lot of times is not useful. A lot of times the errors didn't happen on the first 100 lines. So and then his his solution here was to kinda cache a bit what are the things that happened.
Murilo:So because the build command is also very expensive. Right? So then if the model tries to do these things again to try it So when he was coding, what he was explaining is that the model would actually try to run this a few times to understand what the issue was. And every time it would run, it would be very expensive. So he also suggests that having something like caching the logs and just displaying it again would already be helpful.
Murilo:And then he talks about some other things as well. Sometimes he would try to run commands on the wrong directory, and then the model would basically try to run the command on every directory. And he has some small wrappers who say, this is the directory and maybe want to move to this or that directory, etcetera, etcetera, etcetera. So a lot of these things are very relatable to me using Cloud Code as well. I think a lot of these problems are also problems I recognize.
Murilo:Like, lot of the times I went to I don't know. I'm trying to extract text from from this page, I noticed that every time we truncate it, even when I was trying to inspect it. And I think it's because a lot of the times the CLI wasn't made for these things. It was made for humans. But I'm also and one of the things he even suggested at the very end is to have CLIs for LLMs.
Murilo:So like, not sorry. Not CLIs, but terminals for LLMs. Not sure if I'll go that far. Yeah. But I do feel like there's a lot of room for improvement for all these for all these things.
Murilo:Right. Have you experienced anything like this before, Bart?
Bart:Well, I went to the article, and I was a little bit confused about what the actual point is that the article is making because it's about retaking CLI interfaces, where I think there is a argument you can make, but a lot of the things that he touches upon is more his AI assistant tooling workflow that is not working for him. Within that workflow, tool usage that is not being done select correctly by the model. Like, if your model always does a a head 100 to only show the first 100 lines, I mean, that's wrong tool usage. Right?
Murilo:Yeah. True. But I think it's like a lot of times in the CLI, it's just like, the model just uses Bash for everything. Right? Like, for example, for Cloud, I haven't actually attached MCP servers yet
Bart:because So so so if you look, for example, Cloud Code, what Cloud Code does, in my eyes, very well, it's what makes its difference with a lot of the other ones is it uses RipGrep, which is a regex tool to find text and to really line in a line oriented way. When you see a call for RG, it's it's RibGrab. And what I do believe is that if you have these command line interfaces like RipGrap that that are valuable as a tool, I think it's it is very valuable to also have a strong documentation on that tool for an LM to correctly operate it.
Murilo:Yeah. But the doc but, like, it's not just document because so I agree, but I also know that a lot of this documentation, like if you go to something, something help or man get, you get like a lot, a lot of stuff, right? You get very detailed. I mean, definitely experienced this as a user, right? You have to really scroll and really find the right thing.
Murilo:And sometimes I feel like for the LLM, I can see how that could be a problem.
Bart:If the documentation is too much, mean?
Murilo:If documentation is too much or depend on the way that it's structured. Because even if you do a grip on it, sometimes you may not find what exactly what you're looking for. You may find fragments of what you're looking for. Right?
Bart:Yeah. I think what ideally you want is that it's either condensed enough to be in your system prompt for your AI assistant coding coding session or it's or the it was in your training data for that LM.
Murilo:I think the well, and then maybe what I my my first thought was you probably want it to be condensed and then have, like, for example, like, I don't know. If you're looking for this class of things, then you can run this help x or y, and then you'll give you more information on this. But kinda like have a bit of a tree structure. Right? Like, this is a very general.
Murilo:And if you want this, go there, go there, go there. I probably would I would imagine that that will help LMs with these things.
Bart:Yeah. I'm I'm I'm not sure. Like, if you just use the help command of a CLI, if a human can understand an an LM that has the good performance on tool usage, should probably be able to understand it as well.
Murilo:Then you would you expose it as a tool, really. Like because I think a
Bart:lot of just the bash command. Right? Like a command in your in your terminal. Yeah.
Murilo:True. But wait. Maybe what do you mean by, like, just a command on your terminal? Like, just like like a user would.
Bart:Just like a user would. Yeah. Like like, hit command. Like, your cloud code can call git commit, for example. Like, call could call r g.
Bart:It can call cat this file and show the first 100 lines, but it's just calling tools on your terminal. Right?
Murilo:True. True. Commands But but then I think the documentation on these tools I I I that's what you're saying. Haven't seen that well, I haven't actually seen as much Claude called the help command.
Bart:Oh, I haven't seen it either. No. No. That's true. I agree.
Murilo:So maybe it's a maybe that's another thing. Yeah. Maybe that's true.
Bart:It only makes sense when you when you are in a project where you are very specific about use this tool for that reason, then it becomes relevant to do these things. Well, I haven't been in this situation. Yeah. Because when I'm it's typically I use either, like, for whatever language to build commands, and that's very standard or or or or, like, a hit command or something. It's Yeah.
Bart:But I think you see Claude doing this with RG, which is, like, their solution to correctly replace lines of of code and files. But then so I think that that I think they either instructed or even fine tuned the model to to
Murilo:I see.
Bart:To work very well with RG.
Murilo:And I guess if you're instructed, then just in your instructions, you need to be clear. Right? But you don't have to worry as much about the documentation necessarily.
Bart:And I think a lot of the other things in the article is is really about workflow and what works and what doesn't work and, like, the the whole whack a mole thing. Everybody has has probably been in this in this situation. It's also, like, how how free do you let the model go because if you say, okay. Just just accept everything, and and I'll check-in half an hour, and then it probably did. I will hit commit, no verify.
Bart:But if you if you say, okay. Code edits are fine, but batch commands, I want to verify before you do them. Like, you you could still have this this this you can define this threshold of control that you have. Right?
Murilo:Yeah. Indeed. And I definitely do the auto auto approve everything. Like, okay. Not what I wanted.
Murilo:Maybe one thing, last thing to close this up as well. One possible solution that the Cloud Code actually has is the hooks. I think we talked about this before or no?
Bart:Not on the show yet.
Murilo:Not on the show yet. But basically, like so the way I think of it at least is kinda like you have pre commit hooks, but basically, have code that runs the different steps of your git. Like, commit is right before you commit and all these things. And with hooks reference, you can also add these things there. Right?
Murilo:So you could say run every all the pre commit hooks right after you the the the tool is done. So there's pre to use, post to use, and I think it's like stop or something or sub agent stop or something. But, basically, you can say, always run these hooks after the agent is done doing something. Right? And then in that case, there's no I think it's one way that you can tackle a lot of these problems, right, that the model is not gonna bypass because it cannot.
Murilo:It's basically hard coded there. So Yeah. Yeah. These can be very very valuable. Yeah.
Murilo:Yeah. Yeah. I definitely want to. I looked into it a bit, but I in the end, was like, okay. I'll just put on the on the memorization, like, on the system prompt kinda.
Murilo:But I think it's something that I'll I'll start using more because I think about it now.
Bart:Allows you to do more to lower your threshold versus what you want to verify because you can have, like, hooks and say, like, after doing this change, make sure to run the or or make sure to run the tests, or I want to make sure that this dependency is never used in my project. Like, can add these Yeah. These type of hooks. Right? Yeah.
Bart:Could be a bit more opinionated about the actual output without verifying every step.
Murilo:Which I think is good. Right? I mean Yeah. I think if you're opinionated, when you do look at the code, you're gonna hopefully, right, you're gonna see something that you recognize more easily, and then that's gonna overall improve the experience for you and and all. Very cool.
Murilo:And I think that was it. Anything else you wanna say on this? Okay. That's it then. Like we said, we'll we'll keep everyone posted.
Murilo:We may change a bit. We'll see how how it all goes.
Bart:Yeah. But we've we've we've quite some so we have holiday season coming up. We've quite some IDs. I think we have this weekly session. We have some IDs on on inviting guests.
Bart:Actually, I already have a quite an interesting lineup there. We also have some IDs on, like, let's say, a key insight explainers on topics. Like, we we had the the discussion earlier on agent to agent. Like, maybe it would be interesting to share, like, a three minute video on what is agent to agent, what can you use it for, how does it compare to MCP. So we have a lot of these things that we're working on.
Bart:We're we hope to make that a bit more concrete a few weeks from now so that everybody knows a bit what they can expect for the second second half of this year.
Murilo:Exactly. Yeah. Indeed. So stay tuned. If anyone has any comments, questions, or suggestions as well.
Murilo:Feel free to leave a comment on the on any YouTube channel or send us a message or anything. We'll definitely definitely appreciate it. And while you're at it, if you wanna leave a a review.
Bart:Yeah. And thanks for all the the love on our social channels. Yes. Very much appreciated. And if you have not yet, then please subscribe.
Murilo:Exactly. Alright. Thank you, Bart. Thanks, everyone.
Bart:See you next time. Thank you, Milo. Bye.
Creators and Guests


