Skip to main content
DA / EN
Computer Science

Synthetic Data: A Potential Time Bomb Under the Internet

Bullshit AI. Habsburg AI. AI Slop. The nicknames are colorful — almost funny — but the concern behind them is real: that increasingly distorted, inbred AI systems are about to take over the internet. Human-created data may soon be a luxury, and fact-checking is more essential than ever, warns researcher.

By Birgitte Svennevig , , 12/9/2025

Scroll, scroll, scroll. A heartwarming video pops up: a gorilla in a zoo gently scoops up a lost baby and returns it to its mother. Scroll, scroll, scroll. Now a tiger rescues a baby… maybe a bit unrealistic, and that video is probably fake — but cute enough anyway.

Fast forward five years: A school student is doing an assignment about altruism in the animal kingdom and searches for examples. The Internet’s AI-generated answers tell her that there are plenty of altruistic animals and points her to videos “proving” that animals often help human infants. The student doesn’t stop to think that these animals never existed, the videos never filmed. That this is synthetic data  — i.e., AI-generated data built from previous AI generated data, propagating in increasingly inbred, bizarre, and exaggerated AI creations.

 - There is already an incredible amount of synthetic data, and AI uses them to generate new data. As internet users, we receive more and more information that was not created by humans — and sometimes not even built on human-created data. There is a risk that we will get a lot of nonsense answers from the internet. In some cases, we can spot it, but in other cases, we cannot. In the extreme case, if we lose the ability to distinguish, future AI models trained on internet data may not be nearly as good as the models we can build right now. Additionally, the Internet might lose much of what once made it useful, says Anton Danholt Lautrup, who as postdoc at the Department of Mathematics and Computer Science studies synthetic data.

Data about fictitious people

The idea of synthetic data is, however, really good, but first let’s look at what it is: Synthetic data are defined as data created by a generative AI model and is meant to mimic real data collected from real-world sources, for example from patients.

- You could call them realistic data about fictitious people, suggests Anton Danholt Lautrup.

Synthetic data may be based on full or partial sets of real patient data — but the data are cleaned of personal identifiable information and subsequently processed by an algorithm so that new datasets are formed. These new datasets are now technically synthetic, and they no longer contain information about identifiable individuals.

3 pieces of advice 

  • Be cautious about AI-generated answers to your Google search. Those answers come from a stochastic parrot, and you cannot be sure that this parrot is trained on human-created data.
  • Be critical of sources. If some data — an image, a video or a piece of text — seems a little too exciting or unusual to be true, investigate whether it can be traced back to a trustworthy source.
  • Think before you auto-complete. Tools like Copilot and other AI text-tools in your writing program can forget linguistic and important academic nuances.

Using synthetic data can ease bureaucracy — if, for example, a health researcher wants to collaborate with a third party or publish their data together with a research result.

- Imagine that you have data on 7,000 patients, which you clean and scale up to 50,000 synthetic patients. Now you have seriously large datasets that can be useful in research — and that is a good thing, says Anton Danholt Lautrup.

But then there is the matter of diversity — which risks leaking out of the synthetic datasets:

- Many AI models have the tendency — if you are not careful — to blur diversity during the process, he says.

Collapsing models

In a sense it’s logical that a “dumb” computer would rather create its synthetic examples close to average values than far from the average — this way the probability of creating something realistic is higher.

- But in real life, you want your dataset to be representative of an entire population, with whatever diversity exists — so that is one of the mechanisms you have to be aware of, says Anton Danholt Lautrup.

Whether synthetic data are to be used for research on public diseases or to create videos of gorillas, tigers or hippos rescuing babies in zoos — the risk remains of an actual model collapse.

A closed, hallucinating universe

- By model collapse I mean that models trained on synthetic data over several generations will lose their effectiveness — and perhaps along the way cause many undesirable, even harmful side-effects. Language models do not know that hippos don’t pick up babies and carry them to a human — they cannot distinguish between real and unreal. And with content on the Internet increasingly generated by AI, these misunderstandings and erasure of nuance can rapidly enter our post-factual perception of reality, believes Anton Danholt Lautrup.

When language models run rampant in their own closed, hallucinating universe, critics start talking about “AI Habsburg” or “AI madness.” When we can easily dismiss content as unrealistic and false, the danger of misinformation is less.

But when we cannot determine if content is synthetic or authentic, we may end up living in a world where we cannot trust the information the Internet gives us.

PhD about synthetic data

In his PhD, Anton Danholt Lautrup has especially looked at the positive aspects of synthetic data — especially for research — but he also concludes that artificial data can pose a threat to society.  

In his thesis he writes: “The significant implications of greater data collaboration, dataset augmentation, and amplification should not be viewed in isolation from the potential risks of misuse (…)”. Those risks include increased algorithmic bias, data contamination, and environmental impact, i.e., that it costs quite a lot of energy to create and store synthetic data.”

- As generative artificial intelligence continues to surpass our expectations, blurring the lines between what is authentic and artificial, the question is no longer what it can do, but what we choose to do with it, he says.

The title of the PhD is "Generation and Evaluation of Realistic Tabular Synthetic Data", og here is more information about it.

Meet the researcher

Anton Danholt Lautrup is the author of the PhD thesis "Generation and Evaluation of Realistic Tabular Synthetic Data". He now works as a postdoc researcher at Department of Mathematics and Computer Science.

Go to profile

Editing was completed: 09.12.2025