• semioticbreakdown [she/her]@hexbear.net
    link
    fedilink
    English
    arrow-up
    8
    ·
    8 days ago

    The witting or unwitting use of synthetic data to train generative models departs from standard AI training practice in one important respect: repeating this process for generation after generation of models forms an autophagous (“self-consuming”) loop. As Figure 3 details, different autophagous loop variations arise depending on how existing real and synthetic data are combined into future training sets. Additional variations arise depending on how the synthetic data is generated. For instance, practitioners or algorithms will often introduce a sampling bias by manually “cherry picking” synthesized data to trade off perceptual quality (i.e., the images/texts “look/sound good”) vs. diversity (i.e., many different “types” of images/texts are generated). The informal concepts of quality and diversity are closely related to the statistical metrics of precision and recall, respectively [39 ]. If synthetic data, biased or not, is already in our training datasets today, then autophagous loops are all but inevitable in the future.