• TheOctonaut@piefed.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      15 days ago

      I don’t think you understand the scale of the amount of data that has been fed into these models. Already fed in, as in the models are already created, the baseline already established, the dataset responsible for the output they want already retained.

      Any attempt to “poison” them is attempting to add one, ten, a thousand, a million confounding data points against every webpage 1993-2026, every book ever digitised, every social media post made public, every transcript of every video on YouTube, every code comment made public, every post on this federated platform.

      For news articles alone, that’s about 20 billion non-poisoned articles. Do you know what the difference between a million poisoned pages and 20 billion is? 20 billion.

      The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

        • TheOctonaut@piefed.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          15 days ago

          Ok, suppose that I’ve made it to my 40s without realising that time is in linear motion.

          Explain to me what relevance that has to LLMs?

      • algernon@lemmy.ml
        link
        fedilink
        arrow-up
        1
        ·
        15 days ago

        The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

        I have an automatically generated infinite maze. It produces roughly a million unique pages each day. It used to produce ~60 million pages / day, but a few months ago I decided to firewall some of the crawlers off instead of serving them garbage.

        And I run niche sites. A site with more lucrative traffic than mine (eg, Codeberg, who uses the same software I do) likely generates a lot more garbage.

        There was also a paper, commissioned by Anthropic, I believe, that concluded that only 250 malicious pages they fail to remove from the training set is enough to poison even the largest model. Now, I do not trust anything Anthropic says. But even if we’d need a billion pages to poison a model… I alone served that much in the past year.

        • TheOctonaut@piefed.zip
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          15 days ago

          As you’ve said elsewhere, you’ve created a crawler trap, not a way to poison a model. You’re wasting… some resources I guess? Both theirs and your own. Fascinating to think that you’ve served a billion http requests to no benefit to anyone and you believe this is you winning somehow.

          • algernon@lemmy.ml
            link
            fedilink
            arrow-up
            1
            ·
            15 days ago

            Yes, it does have a cost. It has a far smaller cost than serving the real thing. It also allows me to firewall them off and stop serving them, even if they come at me with real browsers. That’s a very definitive win: I saved CPU time, I saved RAM, I saved network bandwidth, and I stopped them from accessing my stuff. How is that not a win?