James Hennessy, Intergration Engineer

It’s infuriatingly hard to understand how closed models train on their input

One of the most common concerns I see about large language models regards their training data. People are worried that anything they say to ChatGPT could be memorized by it and spat out to other users. People are concerned that anything they store in a private repository on GitHub might be used as training data for future versions of Copilot.
When someone asked Google Bard how it was trained back in March, it told them its training data included internal Gmail! This turned out to be a complete fabrication—a hallucination by the model itself—and Google issued firm denials, but it’s easy to see why that freaked people out.

Training is hard

It’s infuriatingly hard to understand how closed models train on their input