Worldwide, lawsuits are currently pending against providers and developers of (mostly general purpose) AI tools that have trained their systems with large amounts of data subject to third-party copyright or database rights.

For example, in December 2023, The New York Times initiated a proceeding against OpenAI and its partner Microsoft for allegedly using millions of news articles from The New York Times to train its AI system without permission. Similar lawsuits are also pending over Google / Alphabet's competing AI tools (Bard, Imagen, MusicLM, Duet AI & Gemini). proceeding.
How do you prevent your online content from being used - against your will - for third-party AI training purposes? And is that allowed just like that?
The new European AI Act emphasizes that the AI Regulation is without prejudice to the enforcement of copyright rules under Union law (recital 108). On this basis, one might think that copyrighted works or databases - even if published online - are therefore also protected against reproduction by AI developers who "scrape" content from the Internet, as long as you, the right holder, have not given permission ("licensed") to copy those works or databases as training materials for AI tools. However, this is a misconception; in 2019, in European regulations on copyright and related rights in the digital single market made an important exception to this old intellectual property (IP) law principle, namely that (in short) text and data mining of protected material for commercial purposes is permitted unless the rights holder has appropriately made an express reservation to that effect. Machine-readable means (for example, by including lines understandable to scraping tools in a robots.txt file) are considered "appropriate" in this regard. If you do not make such a reservation or do not do so appropriately, you run the risk of not being able to successfully take action against third parties who legitimately access your online content and make reproductions of your content for text and data mining purposes.
By the way, when training AI systems, not only the IP perspective is relevant; privacy law restrictions should also be taken into account. If the online content in question also contains personal data, web scraping is often problematic from that perspective as well. It is not for nothing that the Autoriteit Persoonsgegevens wrote earlier this year that scraping (read: of personal data) 'almost always illegal' is.
