AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman · 8 months ago

AI agents wrong ~70% of time: Carnegie Mellon study

@[email protected] · 8 months ago

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate. LLMs don’t get tired and they can be run in parallel.

@[email protected] · 8 months ago

I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It’s a lot like machine translation. I speak fluent C++, but I don’t speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.

I also don’t speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.

Is this useful? When C++ is getting banned for “security concerns” and Rust is the required language, it’s at least a little helpful.

@[email protected] · 8 months ago

I’m impressed you can make strides with Rust with AI. I am in a similar boat, except I’ve found LLMs are terrible with Rust.

@[email protected] · 8 months ago

I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can’t make the thing I want work in that library either (library docs say it’s possible, but…) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like “address: <pasted error message>” and a bit more than half of the time it is able to respond with a working fix.

@[email protected] · edit-2 3 months ago

deleted by creator

@[email protected] · 8 months ago

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.

I agree. The agents also need to mature more to handle multi-level structures - work on a collection of smaller modules to get a larger system with more functionality. I can see the path forward for those tools, but the ones I have access to definitely aren’t there yet.

@[email protected] · 8 months ago

The problem is they are not i.i.d., so this doesn’t really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we’re already looking at “agents,” so they’re probably already doing chain-of-thought.

@[email protected] · 8 months ago

Very fair comment. In my experience even increasing the temperature you get stuck in local minimums

I was just trying to illustrate how 70% failure rates can still be useful.

Log in | Sign up · 8 months ago

What’s 0.7^10?

@[email protected] · 8 months ago

About 0.02

Log in | Sign up · 8 months ago

So the chances of it being right ten times in a row are 2%.

@[email protected] · edit-2 8 months ago

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.

Log in | Sign up · 8 months ago

Ah, my bad, you’re right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn’t get it to summarise my list of data right and it was always lying by the 7th row.

@[email protected] · 8 months ago

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

Log in | Sign up · 8 months ago

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

@[email protected] · 8 months ago

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

@[email protected] · edit-2 3 months ago

deleted by creator