NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The Second Half (ysymyth.github.io)
thetrustworthy 5 hours ago [-]
For those who are knowledgeable about the field but not yet the author of this post, it is worth mentioning that Shunyu Yao has played a huge role in the development of LLM-based AI agents, including being an author / contributor to:

- ReAct

- Reflexion

- SWE-bench

- OpenAI Deep Research

- OpenAI Operator

armchairhacker 4 hours ago [-]
RL doesn't completely "work" yet, it still has a scalability problem. Claude can write a small project, but as it becomes larger, Claude gets confused and starts making mistakes.

I used to think the problem was that models can't learn over time like humans, but maybe that can be worked around. Today's models have large enough context windows to fit a medium sized project's complete code and documentation, and tomorrow's may be larger; good-enough world knowledge can be maintained by re-training every few months. The real problem is that even models with large context windows struggle with complexity moreso than humans; they miss crucial details, then become very confused when trying to correct their mistakes and/or miss other crucial details (whereas humans sometimes miss crucial details, but are usually able to spot them and fix them without breaking something else).

Reliability is another issue, but I think it's related to scalability: an LLM that cannot make reliable inferences from a small input data, cannot grow that into a larger output data without introducing cascading hallucinations.

EDIT: creative control is also superseded by reliability and scalability. You can generate any image imaginable with a reliable diffusion model, by first generating something vague, then repeatedly refining it (specifying which details to change and which to keep), each refinement closer to what you're imagining. Except even GPT-4o isn't nearly reliable enough for this technique, because while it can handle a couple refinements, it too starts losing details (changing unrelated things).

dceddia 3 hours ago [-]
I wonder how much of this is that code is less explicit than written language in some ways.

With English, the meaning of a sentence is mostly self-contained. The words have inherent meaning, and if they’re not enough on their own, usually the surrounding sentences give enough context to infer the meaning.

Usually you don’t have to go looking back 4 chapters or look in another book to figure out the implications of the words you’re reading. When you DO need to do that (maybe reading a research paper for instance), the connected knowledge is all at the same level of abstraction.

But with code, despite it being very explicit at the token level, the “meaning” is all over the map, and depends a lot on the unwritten mental models the person was envisioning when they wrote it. Function names might be incorrect in subtle or not-so-subtle ways, and side effects and order of execution in one area could affect something in a whole other part of the system (not to mention across the network, but that seems like a separate case to worry about). There’s implicit assumptions about timing and such. I don’t know how we’d represent all this other than having extensive and accurate comments everywhere, or maybe some kind of execution graph, but it seems like an important challenge to tackle if we want LLMs to get better at reasoning about larger code bases.

fullstackchris 20 minutes ago [-]
This is super insightful, and I think there is at least part of what you are thinking of: an abstract syntax tree! Or at the very least one could include metadata about the token under scrutiny (similar to how most editors can show you git blame / number of references / number of tests passing in the current code you are looking at...)

It makes me think about things like... "what if we also provided not just the source code, but the abstract syntax tree or dependency graph", or at least the related nodes relevant to what code the LLM wants to change. In this way, you potentially have the true "full" context of the code, across all files / packages / whatever.

debone 2 hours ago [-]
Not really true.

You can have a book where in the last chapter you have a phrase "She was not his kid."

Knowing nothing else, you can only infer the self-contained details. But in the book context this could be the phrase which turns everything upside down, and it could refer to a lot of context.

wavemode 6 hours ago [-]
> AI has beat world champions at chess and Go, surpassed most humans on SAT and bar exams, and reached gold medal level on IOI and IMO. But the world hasn’t changed much, at least judged by economics and GDP.

> I call this the utility problem, and deem it the most important problem for AI.

> Perhaps we will solve the utility problem pretty soon, perhaps not. Either way, the root cause of this problem might be deceptively simple: our evaluation setups are different from real-world setups in many basic ways.

LLMs are reaching the same stage that most exciting technologies reach. They have quickly attracted lots of investor money, but that is going to have to start turning into actual money. Many research papers are being written, but people are going to start wanting to see actual improvements, not just theoretical improvements on benchmarks.

PaulHoule 6 hours ago [-]
I think of some of the ways LLMs perform better in real life than they do in evals.

For instance I ask AI assistants a lot about what some code is trying to do in applications software where it is a matter of React, CSS and how APIs get used. Frequently this is a matter of pattern matching and doesn't require deep thought and I find LLMs often nail it.

When it comes to "what does some systems oriented code do" now you are looking at halting problem kind of problems or cases where a person will be hypnotized by an almost-bubble-sort to think it's a bubble sort and the LLM is too. You can certainly make code understanding benchmarks aimed at "whiteboard interview" kind of code that are arbitrarily complex, but that doesn't reflect the ability or inability to deal with "what is up with this API?"

animuchan 6 hours ago [-]
I think what you're describing is, easy tasks are easy to perform.

Which is, of course, true. Anecdotally, a lot of value I get from Copilot is in simple, mundane tasks.

PaulHoule 6 hours ago [-]
I think easy tasks are basically "linear" in that you don't have interactions between components. If you do have interactions between components complexity gets out of control very quickly. Many practical problems for instance are NP-complete or undecidable. Many of them could be attacked by SMT or SAT but often you can solve them using tactics from math.
pjc50 6 hours ago [-]
stapedium 6 hours ago [-]
Current AI is like search. You still have to know the vocabulary and right questions to ask. You also need the ability to differentiate a novel answer from a hallucination. Its not going to replace lawyers or doctors any time soon.
mplanchard 6 hours ago [-]
Meta request to authors: please define your acronyms at least once!

Even in scientific domains where a high level of background knowledge is expected, it is standard practice to define each acronym prior to its use in the rest of the paper, for example “using three-letter acronyms (TLAs) without first defining them is a hindrance to readability.”

nathell 6 hours ago [-]
Alessandra Sierra has a great piece on this:

https://www.lambdasierra.com/2023/abbreviating/

a1ff00 5 hours ago [-]
Couldn’t agree more. Had a hell of a time looking at how they were using RL after first use, but gave up in frustration when the remainder of the text was more use of undefined symbols/acronyms.
philipwhiuk 4 hours ago [-]
On this note - what's "i.i.d."?
Bernard_sha_256 3 hours ago [-]
I believe this is referring to the probability concept of "Independent and Identically distributed".

In the usage in the book/page, it seems to refer to how tasks/problems are run in parallel and the learning averaged, whereas the author is advocating these problems run sequentially.

https://en.wikipedia.org/wiki/Independent_and_identically_di...

1 hours ago [-]
GiorgioG 1 hours ago [-]
More AI hype from an AI "expert". AI in software development is still a junior developer that memorized "everything" and can learn nothing beyond that, he/she will happily lie to you and they'll never tell you the most important thing a developer can be comfortable saying: "I don't know".
jarbus 6 hours ago [-]
I largely agree, and this is actually something I've been thinking for a while. The problem was never the algorithm; it's the game the algorithm is trying to solve. It's not clear to me what extent we can push this to aside from math, coding. Robotics should be ripe for this, though.
daveguy 31 minutes ago [-]
Unfortunately the feedback loop for robotics is many many orders of magnitude slower than math / coding problems. And when you get to artificial environments, you are learning artificial dynamics -- same limitations as the benchmarks.
fullstackchris 6 minutes ago [-]
Moravec's paradox
cadamsdotcom 5 hours ago [-]
Benchmark saturation will keep happening.

Which is great! There's room in the world for new benchmarks that test for more diverse things!

It's highly likely at least one of the new benchmarks will eventually test for all the criteria being mentioned.

nottorp 5 hours ago [-]
Is it me or are they proposing making LLMs play text adventures?
ma_s79miskl6 55 minutes ago [-]
he was going to poke some kids
yapyap 5 hours ago [-]
> Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?”

To say we are at a point where AI can do anything reliably is laughable, it can do much and it will tell you any answer whether right or wrong with full confidence. To trust such a technology in the big no-human decisions like we want it to is foolswork.

m0llusk 6 hours ago [-]
Um, what is RL?
animuchan 6 hours ago [-]
Rocket Launcher?

Please, let it be Rocket Launcher for once.

zeigotaro 6 hours ago [-]
Reinforcement Learning
zomglings 6 hours ago [-]
Reinforcement Learning.

I hate acronyms with a fierce passion.

coolThingsFirst 6 hours ago [-]
While i do agree that acronyms can be PITA, AFAIK RL seems to truly lead to AGI. ICBA to provide more detail.
zomglings 6 hours ago [-]
My blood pressure just tripled.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:19:49 GMT+0000 (Coordinated Universal Time) with Vercel.