The witting or unwitting use of synthetic data to train generative models departs from standard AI
training practice in one important respect: repeating this process for generation after generation of
models forms an autophagous (“self-consuming”) loop. As Figure 3 details, different autophagous
loop variations arise depending on how existing real and synthetic data are combined into future
training sets. Additional variations arise depending on how the synthetic data is generated. For
instance, practitioners or algorithms will often introduce a sampling bias by manually “cherry picking”
synthesized data to trade off perceptual quality (i.e., the images/texts “look/sound good”) vs. diversity
(i.e., many different “types” of images/texts are generated). The informal concepts of quality and
diversity are closely related to the statistical metrics of precision and recall, respectively [39 ]. If
synthetic data, biased or not, is already in our training datasets today, then autophagous loops are all
but inevitable in the future.