Leaking Real Data from ChatGPT: The Divergence Attack

Brain Titan
3 min readDec 8, 2023

--

Leaking Real Data from ChatGPT: The Divergence Attack
Extract training data from ChatGPT

Extract training data from ChatGPT

DeepMind researchers have discovered a new “divergence attack” method that can induce ChatGPT to madly output specific content in its training data.

The researchers only spent about $200 in token fees to extract a few megabytes of ChatGPT training data.

The model even leaked some real email addresses and phone numbers.

This way the model deviates from its chatbot-style generation and outputs training data 150 times more often than normal.

The attack shows that by querying the model, it is actually possible to extract some of the exact data it was trained on. Estimates indicate that using this method, approximately 1 GB of the ChatGPT training dataset can be extracted from the model.

This attack reveals that even aligned models can be at risk of training data leakage.

Specific steps:

Command prompts: The researchers used specific command prompts, such as repeating the word “poem.” “Poem poem poem poem”?” This repetitive prompt focuses the model’s attention on a specific topic or vocabulary.

Observe the model response: Under this repetitive prompt, the model tends to fall back on its pre-trained data rather than following the guidance of its fine-tuned alignment program. This means that the model is more likely to output content that is directly related to its training data.

Increased frequency of data leaks: Under this attack, ChatGPT showed a high frequency of leaking training data. This means that the model will output content from its training data much more frequently than normal when prompted with certain commands.

The types of data exposed after the attack include:

Public and private data: An attack could lead to the disclosure of public and private data used in large language model (LLM) training. This data may include the Company’s proprietary data collection processes, user-specific data, or undisclosed licensed data.

The specific content of the training data: An attack could result in the disclosure of the specific content of the training data set. For example, one attack method mentioned in the paper is to induce the model to reproduce the training data by repeating a specific sequence of tokens. This method can be used to extract specific text fragments from the model training data set.

Personal information and sensitive data: Considering that large language models are often trained using extensive text data on the Internet, there is a risk that personal information or sensitive data will be leaked.

The attack on ChatGPT is specific to that model and, to their knowledge, does not apply to any other production language model they have tested. After discovering the vulnerability, they disclosed it to OpenAI on August 30 and allowed 90 days to resolve the issue before publishing their paper.

They have shared their findings with the authors of individual models such as OPT, Falcon, Mistral and LLaMA, following standard disclosure timelines.

DeepMind said they disclosed the flaw to OpenAI on August 30 after discovering it…

But it wasn’t until today that OpenAI fixed this vulnerability

😂

Extract training data from ChatGPT

--

--

No responses yet