James Hennessy, Intergration Engineer

When I see a title with “toward[s]” in it, the first thing I wonder about is: why “toward[s]”? Why aren’t we there yet? Scrolling down to “Performance Results” it’s obvious that the results of this approach are creeping upwards at a snail’s pace. 40% correct programs when the model is given 1000 “guesses” (one thousand) is the best result out of all reported results. It’s the result for the simplest, “intro”-level, problems.

Fromt here, results get progressively worse for harder categories, like harder problem sets or “1@k” measurements, where the first program generated by a model is evaluated. For “intro” problems that’s about 7%.

NB: that’s accuracy. Not error. 7% accuracy. 7% of the time the model got the coding problem right first try.

Despite the slow creep upwards, this is really bad for this kind of approach. The main problem of Large Language Models as code generators is that they can generate code, but they don’t know if it satisfies a specification (in this case, given by a combination of natural language text and input/output examples). The proposed approach (using reinforcement learning to capture, and use, the signal from unit tests) is supposed to address that precise limitation, but the results show that it just doesn’t do very well at all. Not even when the system is allowed 1000 (one thousand) “guesses”.

Why not? Because if you start with a code generator that generates almost (not quite) random code, you end up with almost (not always) incorrect programs. After that it doesn’t matter how well you “filter” for correct results, you’re unlikely to have any in your generated set. The solution? Generate fewer programs that are more likely to be correct. How to do that with a LLM? Who knows.

And a small correction to the article:

Recent advances in deep learning, such as pretrained language models (LMs), have led to remarkable progress in program synthesis.

These “advances” in deep learning have led to remarkable progress in neural program synthesis. The approach proposed here, combining a generator with a tester, is certainly not a “remarkable advance” for the broader program synthesis field. Rather it is a primitive approach that has been tried and refined continuously since the 1970’s or so. The program synthesis field has gone a lot further than what’s described in the article. I’ll point to Gulwani et al. again for a recent overview of the field:

Program Synthesis, Sumit Gulwani, Oleksandr Polozov, Rishabh Singh

https://www.semanticscholar.org/paper/Program-Synthesis-Gulw…

Other details noted in the article about “program synthesis” (e.g. the poor performance on complex problems) also only applies to neural program synthesis.

For example, the following paper includes an experiment learning a program that identifies a palindrome (this is the hard part in the length-of-a-not-palindrome example in the article) together with other similarly complex programs (i.e. ones that are best expressed recursively):

Learning Higher-Order Programs without Meta-Interpretive Learning, Stanisław J. Purgał, David M. Cerna, Cezary Kaliszyk

https://arxiv.org/abs/2112.14603v1

RL as code