Why the data theft scare over ChatGPT is utter nonsense

Disclaimer: I’m planting my flag to see what nuances of this argument I’m missing. Feel free to critique my argument and provide your perspective. 😉  

The myth

Samsung caused a stir by prohibiting their employees from using LLMs like ChatGPT, Bard, and Claude, sparking a trend among other companies.

The myth is that if my employee inputs sensitive data (IP, PII, etc.) into ChatGPT, then a threat actor or competitor on the other end might be able to fish the data out, leading to serious data leakage issues. I’ll admit, when I first caught wind of this, I nodded along, figuring they had insider info I was missing.

But then… this same anxiety kept popping up from others in every chat I had about LLM security. And it hit me: nobody really dug deep into the actual risk, which, spoiler alert, is pretty darn small with closed-source foundational models like ChatGPT. More on that in a bit.

I’m sure there are many reasons why security teams default to playing it safe and cutting their employees off from using LLMs, but the cost/benefit ratio is massive. 

Now, I get why security teams might play it safe and cut off access to LLMs, but they’re missing out on the benefits, big time. For example, one study showed employees using LLMs were 14% – 34% more productive per hour than those not using them. Keep in mind that we’re still at the beginning of how this technology will transform how we work, so I’m happy to bet that this % will increase significantly. 

Here’s another piece of the puzzle many overlook: ChatGPT comes with its own “security net”. When inputting data into ChatGPT OpenAI is running through multiple steps to ensure they’re not using sensitive data to train their models. An obvious step would be filtering out easily detectable PII (SSN, phone number, etc.) before training. 

From the GPT-4 Technical Paper: “We take a number of steps to reduce the risk that our models are used in a way that could violate a person’s privacy rights. These include fine-tuning models to reject these types of requests, removing personal information from the training dataset where feasible, creating automated model evaluations, monitoring and responding to user attempts to generate this type of information, and restricting this type of use in our terms and policies.”

And the cherry on top? You can opt out of having your data used for training. This isn’t new; it’s been an option for months.

Why data leakage isn’t a risk

The fundamental job of an LLM is to generate text. Simply guessing what the next word (token) will be. A common misconception is that LLMs act like data banks, storing user information. This is a fundamental misunderstanding, leading to misguided decisions and a loss of potential productivity gains.

Before diving deeper into my argument I want to flag a 2021 study involving GPT-2. Researchers managed to extract sensitive data (like names and phone numbers) because the model had inadvertently “memorized” this information. Remember, LLMs are designed to generalize, not memorize. Such instances of data recall are rare exceptions, not the norm.

So, you might be thinking, “Does this mean LLMs can store and leak my data?” Generally, no. Here’s why:

The likelihood of a model memorizing your data is based on three variables. 

  1. Frequency in the dataset: The number of times the specific user’s data appears in the training dataset. I.e. upload the same SSN into ChatGPT (X) times. 
  2. Dataset size: The total volume of data (words or tokens) used for training.
  3. Memorization tendency: A hypothetical constant indicating how likely a model is to retain specific data, influenced by its design, size, and training approach. This is a hypothetical number and will vary between different models and training setups.

You can picture the basic formula below. 

Let’s guess at some numbers and see how likely it is for our data to be memorized. 

  1. Frequency in the dataset: Let’s assume the specific user’s SSN appears 5 million times in the training dataset.
  2. Dataset size: Let’s set the estimated dataset size at 500 billion words.
    1. Note: I’m being VERY conservative. LLAMA 2 was trained on 2 trillion tokens and I’m sure GPT-4 has more, which likely equates to a little less than 2 trillion words. 
  3. Memorization tendency: For the sake of this example, let’s assume a very small value, say 0.1, to represent the model’s tendency to memorize specific data.

We have our numbers plugged in and the likelihood of memorization is 0.000001. That’s a one-in-a-million chance of data leakage. You know what else has a one-in-a-million chance? Being struck by lightning, finding a four-leaf clover, or having a meteorite land in your backyard. 

Do you now see how ridiculous this concern is? But that’s not all, there’s more. 

Caleb Sima from the Cloud Security Alliance listed three quality reasons as to why the likelihood of data leakage is blown out of proportion. 

  • Memory threshold: My main argument. We need a ridiculous number of references for the LLM to memorize anything meaningful. I.e. 5 million of the same SSN uploaded into a model trained on a dataset of 500 billion. 
  • Formatting: You have to have enough of the secret or format of the secret to match the pattern to generate accurately. 
  • Accuracy: Lastly, the most difficult. You must be able to determine whether the response is accurate and not a hallucination. 

Considering these factors, along with OpenAI’s efforts to filter sensitive data before training, it becomes evident that most concerns about LLM data leaks are overblown.

The real risk

Sure, there’s a legit risk of data leakage, but it’s mostly with LLMs using Retrieval Augmented Generation (RAG) architecture.

As you can see from the above diagram we’ve included new items. This setup includes a vector database and agents (like GPTs). It’s like giving your LLM a library card to access external info beyond its training data. And yeah, that can be dicey. 

Here’s the kicker: The real risk of sensitive data theft is higher with RAG-based LLMs, which are often part of existing products, not the basic models everyone freaks out about.

The risk is there, but it’s subtle. And subtlety matters big time here. We’re talking about holding back your company from nailing more ideas, productivity, and, let’s not forget, revenue and market growth.

Next time the security panic button gets hit, pause and think about the cost vs. benefit. Is the payoff huge? Then maybe it’s time to ditch the herd mentality and draw your own map.