What is model collapse and why should we care about it?
Model collapse is what happens when an AI model is trained on data that was itself generated by AI models. The concern is that each generation of training subtly warps the data distribution — rare but real patterns get squeezed out, and synthetic artifacts get amplified — until the model's outputs become increasingly narrow and distorted.
This matters because the internet is rapidly filling with AI-generated text, images, and code. If future models are trained on this content without careful filtering, they could inherit a kind of generational degradation, like a photocopy of a photocopy. The outputs might look fluent but quietly lose the diversity and accuracy that made earlier models useful.
Researchers are actively working on data provenance tools and filtering pipelines to detect and exclude synthetic content from training sets, but it's an open and urgent problem as the line between human and AI-generated content continues to blur.