Tuesday, May 9, 2023

Poisoning the Well



I almost feel irresponsible for posting this picture above, so a mandatory public service announcement is in order: Do not put bleach in your air vents.

And now for something totally different:

Two types of dataset poisoning attacks that can corrupt AI system results
Mar 2023, phys.org

The researchers began by noting that ownership of URLs on the Internet often expire—including those that have been used as sources by AI systems. That leaves them available for purchase by nefarious types looking to disrupt AI systems. If such URLs are purchased and are then used to create websites with false information, the AI system will add that information to its knowledge bank just as easily as it will true information—and that will lead to the AI system producing less then desirable results.

The research team calls this type of attack split view poisoning. 

There is another way that AI systems could be subverted—by manipulating data in well known data repositories such as Wikipedia. [This has been a tactic by authoritarian governments since its inception.]

via Google, ETH Zurich, NVIDIA and Robust Intelligence: Nicholas Carlini et al, Poisoning Web-Scale Training Datasets is Practical, arXiv (2023). DOI: 10.48550/arxiv.2302.10149

Post Script:
Another public service announcement -- the datasets used by today's deep learning artificial intelligence are not stored locally, they are stored as URLs which have to be accessed at the time of execution.

In other words, the Stable Diffusion LAION dataset is not a bunch of pictures; instead, it's a bunch of url's of pictures, like a url with ".jpg" at the end. This is good because it makes the memory storage for 5 billion images much smaller, because you're storing the link to the picture, not the actual picture. 

For anyone who's done anything on the internet for more than 5 years, you know what link rot is, and why it should make you really confused as to how people think the current crop of AI magic will continue to work as all the urls rot out, making the dataset smaller and smaller, and the quality of the output worse and worse. 

(And this isn't even considering the intentional data poisoning attacks described above.)

Also, double check the thumbnail for this post, which is an ad-poisoning injection about "this one trick" to get the dust out of your air vents by pouring bleach in them, itself designed not to advertise a product, but simply to get you to click so the broker can charge both parties for clickthroughs, even though no "eyeballs" took place, and as much as this shouldn't be happening, it is, and it's now in The Big Dataset in the Sky, poisoning our artificial intelligent systems. 

No comments:

Post a Comment