Are training data from ChatGPT compatible with the right to data erasure under Article 17 AVG? Jorie Corten, paralegal at Watsonlaw and master's student in International Technology Law, dives into this issue. The right to data erasure probably also applies to ChatGPT's training data, she concludes. In practice, though, this is complicated: "While it is technically possible to erase a data subject's personal data from ChatGPT, this has little impact on the patterns the model has already learned."
Artificial intelligence has revolutionized the way we interact with technology, and one of the most recent and impressive examples of this is ChatGPT.[1] This model can generate human-like text, making it useful for a wide range of applications, from customer service chatbots to content creation.
However, due to its wide applicability and rapid emergence, ChatGPT has created many legal uncertainties. This research focuses on the General Data Protection Regulation (GDPR)[2].
ChatGPT is a Large Generative Artificial Intelligence Model (LGAIM), a technology developed using a large amount of data.[3] ChatGPT's training was conducted in two phases: i) training data, and ii) human input data.[4]
Training data
ChatGPT was trained using a dataset of more than 45 terabytes of text from the Internet, including books, papers, web pages and other text-based content.[5] Output may be biased and inaccurate due to probabilistic responses and imperfect quality of the training data (texts from the Internet).[6]
Human input data
Having been trained with text from the Internet, ChatGPT has been refined with human input. For example, prompts such as "write a research paper on ChatGPT's compatibility with the right to data erasure" or users providing feedback by clicking a "thumbs up" button help ChatGPT train itself.[7]
The datasets used to train ChatGPT are typically sources of a wide range of texts available on the Internet that contain personal data.[8] I asked ChatGPT whether it excluded these personal data from its dataset in its training process.
ChatGPT replied that "(...) ChatGPT itself did not exclude personal data from its dataset during the training, (...)." This article therefore assumes that ChatGPT's training data includes personal data within the meaning of Article 4(1) AVG.
The AVG provides a legal framework for data protection and applies "to the processing of personal data..."[9] Data protection rights, such as the right to data erasure, apply only when personal data are processed.[10
The right to erasure
The right to data erasure is an essential online right of data subjects.[11] A data subject is a natural person who is identified or identifiable through personal data.[12] According to Article 17(1) of the AVG, data subjects have "the right to obtain from the controller erasure of personal data relating to them (...)". This right to data erasure arose in the landmark Google Spain judgment of the Court of Justice of the European Union (CJEU).[13]
The CJEU ruled that a "fair balance" must be sought between data subjects' rights to data protection and the legitimate interests of search engines.[14] The CJEU continued that these data subjects' rights "should in principle take precedence not only over the economic interest of the search engine operator, but also over the interest of this public in accessing this information when searching on that person's name."[15
But when the interference is justified by the general public's interest in accessing the information in question, the data subject's rights may not prevail.[16] The AVG codified Google Spain and elaborated on the right to data erasure.[17]
ChatGPT is an LGAIM and not an Internet search engine. However, ChatGPT is commonly used as a search engine and is trained on data provided by Internet search engines to the general public. Therefore, I would argue that the same rules apply.
Personal data of the trained model of ChatGPT can be deleted in two different ways: (i) retraining the dataset, or (ii) machine unlearning.
Retraining the dataset
Based on a modified training dataset, the ChatGPT model can be retrained.[18] A significant drawback is that (re)training the dataset is very intensive, making it expensive and time-consuming.[19]
Machine unlearning
Another option is to modify the model itself after it has been trained ("machine unlearning").[20] However, this is very complicated and almost never feasible with existing systems.[21] Techniques for machine unlearning are only now being presented and have not been sufficiently researched.[22]
When personal data of one data subject is erased from the training data, it usually has little impact on the patterns already learned by the model.[23] When larger numbers of personal data are erased from the training data, it can have more impact on the patterns of the model. This makes the right to data erasure particularly interesting in a collective action.[24]
The first dilemma is that based on Google Spain, the general public's interest in having access to the information in question may override the right to data erasure.[25]
The Article 29 Working Party, an independent European advisory body on data protection and privacy, stated that "Internet users have an interest in receiving information through search engines."[26] In this context, the fundamental right to freedom of expression, as defined in Article 11 of the European Charter, should be considered.[27]
A second dilemma arises because ChatGPT's server is located in the United States (U.S.) but the model is trained on data from around the world.[28] The CJEU ruled in its landmark Google v. CNIL case that search engines must erase personal data only in the EU.[29] However, the CJEU continued that EU law does not prohibit the erasure of personal data from all servers.[30]
This raises the key questions of this article: (i) Is the interest of the general public with respect to ChatGPT's training data less than the right of data subjects to data erasure, so that the CJEU's reasoning in Google Spain for ChatGPT can be overruled?
And then, (ii) should this data be deleted from ChatGPT's server in the US, from the model accessible in Europe, or only from the model accessible in the applicant's country?
This article has shown that ChatGPT's training data is likely to include personal data within the meaning of Article 4(1) AVG and therefore the right to data erasure applies.
While it is technically possible to erase a data subject's personal data from ChatGPT, this has little impact on the patterns already learned by the model. Therefore, it is unclear whether ChatGPT's training data is fully compatible with the right to data erasure under Article 17 AVG.
Further research is needed to clarify this issue. In the meantime, one possible solution to this issue is for those involved to initiate a class action.
Sources
1. Hacker, P., Engel, A., & Mauer, M. (2023). Regulating chatgpt and other large generative ai models. arXiv preprint arXiv:2302.02337, p. 2-3. (Hereinafter: Hacker et al (2023)); Hacker, P. (2023). Understanding and regulating ChatGPT, and other large generative AI models. Verfassungsblog: On Matters Constitutional, p 2. (Hereinafter: Hacker (2023)).
2. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016, on the protection of natural persons with regard to the processing of personal data and on the free movement of such data and repealing Directive 95/46/EC (General Data Protection Regulation).
3. Hacker et al (2023), pp. 2-3; Sarel, R. (2023). Restraining ChatGPT, p. 8-9. (Hereinafter: Sarel (2023)).
4. Europol. (2023). ChatGPT: The impact of Large Language Models on Law Enforcement. Accessed April 7, 2023, from https://www.europol.europa.eu/cms/sites/default/files/documents/Tech%20Watch%20Flash%20-%20The%20Impact%20of%20Large%20Language%20Models%20on%20Law%20Enforcement.pdf, p. 3-4. (Hereinafter: Europol (2023));
5. Europol (2023), pp. 3-4; Sarel (2023), p. 9.
6. Sarel (2023), pp. 8-9.
7. Hacker (2023), p 2.
8. Protection of personal data and privacy. Council of Europe. Accessed April 8, 2023, from https://www.coe.int/en/web/portal/personal-data-protection-and-privacy.
9. Purtova, N. (2018). The law of everything. Broad concept of personal data and future of EU data protection law. Law, Innovation and Technology, 10(1). Accessed March 29, 2023, from https://doi.org/10.1080/17579961.2018.1452176, p. 43-44.
10. Ibid.
11. Tzanou, M. (2020). The unexpected consequences of the EU Right to Be Forgotten: Internet search engines as fundamental rights adjudicators. In Personal Data Protection and Legal Developments in the European Union (pp. 279-301). IGI Global, pp. 1-2 of the electronic copy. (Hereinafter: Tzanou (2020)).
12. Article 4(1) AVG.
13. CJEU [GC] 13 May 2014, Google Spain, C-131/12, ECLI:EU:C:2014:317, para. 99 and judgment. (Hereinafter Google Spain)
14. Google Spain para. 81.
15. Google Spain para. 99 and the ruling.
16. Ibid.
17. Tzanou (2020), pp. 1-2.
18. Veale, M., Binns, R., & Edwards, L. (2018). Algorithms that remember: model inversion attacks and data protection law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 376(2133), 20180083, p. 9. (Hereaftera: Veale et al (2018)).
19. Ibid.
20. Veale et al (2018), p. 9.
21. Ibid.
22. Cao, Y., & Yang, J. (2015, May). Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy (pp. 463-480). IEEE.
23. Veale et al (2018), p. 10.
24. See, e.g., Ausloos, J. Toh, J., & Giannopoulou, A. (Nov. 23, 2022). The case for collective action against the harms of data-driven technologies. Ada Lovelace Institute. Accessed April 7, 2023, from https://www.adalovelaceinstitute.org/blog/collective-action-harms/.
25. Google Spain, para. 99.
26. Article 29 Working Group. (2014). Guidelines on the implementation of the Court of Justice of the European Union judgment on "Google Spain and Inc. v. Agencia Española de Protección de Datos (AEPD) and Mario Costeja González" C-131/12. Accessed April 8, 2023, from https://ec.europa.eu/newsroom/article29/tems/667236/en, p. 6.
27. Ibid.
28. OpenAI. (2023). GPT-4. Accessed April 7, 2023, from https://openai.com/research/gpt-4.
29. CJEU [GC] 24 September 2019, Google v CNIL, C-131/12, ECLI:EU:C:2019:772, para.63. (Hereinafter Google v CNIL).
30. Google v CNIL, para. 72.