Phantom Data: Unmasking AI Training with Copyright Traps

Like the mapmakers of the 20th century, researchers at Imperial College London have proposed a new process in identifying the assimilation of the copyright holders’ work in LLMs data.

The new technique was proposed at the International Conference on Machine Learning in Vienna, and the description of it, among other things, is contained in the preprint on the server arXiv in relation to legal aspects of datasets for training artificial intelligence systems. Current LLM’s, and other AI models, use huge volumes of text, images and other content available online to carry out their roles.

The Imperial College authors suggest a way of determining whether data has been used for Artificial Intelligence training in their paper. It might also lead to the further release of information on the fast-growing area of generative AI and reveal how authors’ texts are being used.

‘Inspired by early 20th century cartographers who inserted phantom towns into maps to spot forgery attempts, we, in essence, incorporate detectable ‘copyright traps’ – imaginary sentences – into original text, which can be detected if copied with one of the trained LLMs,’ lead researcher Dr Yves-Alexandre de Montjoye from Imperial’s Department of Computing said.

The techniques work through content owners’ reciting of a copyright trap severally in their paper (for instance, a news story). If an LLM developer takes this data, scraps, and train a model on it, then the data owner can claim their data was used based on the controversy in the results generated by the model.

What also makes this method suitable for the online publishers is that the copyright trap sentences can be embedded in the news articles in such a way that they cannot be easily spotted but are likely to be scraped by data scraper agents. But, with regards to this, Dr. de Montjoye suggests that the developers of the LLM could work towards ways of eliminating these traps, affirming that it would be a resource-intensive process that would have to implemented by LLM developers in order to counter the latest instances of embedding methods.

In order to support their method, the researchers collaborated with a group in France to create a ‘‘genuinely bilingual’’ English-French 1. 192M model a state-of-the-art Zero-Shot LM has been polluted with diverse copyright traps incorporated in the LLM training set in this enabled 3 billion-parameter LLM. These experiments’ outcomes imply improved visibility tools for the field of LLM training.

Against this background, AI companies are less willing to provide details to their training data, Imperial College London’s Igor Shilov, co-author of the paper, noted. Since the identity of the training data available for practicing LLMs like GPT-3 and LLaMA (an older but similar product by OpenAI and Meta AI, correspondingly) is known to the public while the training datasets of the newest LLMs like GPT-4, LLaMA-2, etc are not, the developers of these models are economically motivated not to disclose the existence of the training data, therefore, the

Matthieu Meeus also supported the theme of transparency regarding AI training and the fair compensation of the content creators. “As much as this is a very complex issue and there seems to be no remedy for it, our hope is that this work on copyright traps is part of the sustainable solution,” he said.

References:

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Read More