Training AI models – European Data Protection Board’s opinion and recent developments

This article was co-authored by Ed Le Gassick, trainee solicitor.

The rapid integration of artificial intelligence (AI) across various sectors raises a pivotal question: How can organisations ensure that their AI models comply with the stringent requirements of the General Data Protection Regulation (GDPR)?

On 17 December 2024, the European Data Protection Board (EDPB) issued Opinion 28/2024 on personal data processing in the context of AI models, providing crucial guidance in response to four main queries raised by the Irish Data Protection Commission (DPC) in September 2024.  

This article delves into the key takeaways from the EDPB's Opinion, identifies areas where further clarification is needed, and provides insights into related latest regulatory enforcement and practical recommendations for organisations striving to align their AI governance frameworks and practices with the GDPR.

Key takeaways in the EDPB's opinion

The Opinion clarifies how controllers can comply with GDPR requirements when training models on personal data during both development and deployment stage.  In particular, it focuses on:

  • Anonymisation of AI models: the Opinion emphasises that determining whether an AI model can be considered truly anonymous requires a case-by-case assessment. An AI model trained on personal data may not be considered anonymous by nature, as personal data can potentially be extracted from the system.  It must be highly improbable for any individual to be identified, directly or indirectly, from the model's outputs.
  • Legitimate Interests as a legal basis: the Opinion confirms that reliance on legitimate interests as a legal basis is possible in relation to AI model development and deployment. Controllers must conduct a legitimate interest assessment (LIA), which includes identifying a legitimate interest, demonstrating the necessity of processing, and performing a balancing test to ensure that the controller’s interests do not override the rights and freedoms of individuals.
  • Impact of unlawful processing: the Opinion considers scenarios where personal data is unlawfully processed during the development phase of an AI model. It highlights that such unlawful processing may negatively impact the lawfulness of subsequent processing or operation of the AI model, depending on the circumstances.

Anonymisation of AI models

The Opinion clarifies that whether an AI model can be considered truly anonymous requires a case-by-case assessment. Anonymity would mean that an individual is no longer identifiable, and the data is no longer personal data. In such case, the GDPR would not apply to the processing at hand.

How to assess anonymity in an AI model?

The EDPB provides a test to assess whether an AI model can be considered anonymous.  This assessment hinges on whether using “all the means reasonably likely to be used”, there is a negligible likelihood of:

  • direct extraction of personal data regarding individuals whose data was used to train the model.
  • unintentional or intentional disclosure of personal data to general users, such as personal data reaveled in response to queries made to chatbots (e.g. ChatGPT).

To meet the standard of anonymity, the risk of identifying or exposing personal data for any data subject must be insignificant.

Whilst the test outlined above is helpful, the overarching conclusion that AI models are inherently not anonymous, presents challenges for developers and controllers.

This perspective implies that any data processed by such AI models falls under the full force of the GDPR.  Consequently, a data subject may theoretically be able to exercise their data subject rights.,.  Controllers would then need to assess the proportionality and feasibility of accommodating these rights.  

The EDPB's Opinion acknowledges that AI models can infer personal data from probabilistic relationships.  However, it does not clearly define the boundary between the AI model’s functionality and the data controller's responsibilities, leaving ambiguity around the extent to which accountability for data processing within the model applies to the controller.

Notably, this stance contrasts with the Opinion of the Hamburg Supervisory Authority (S.A.) outlined in its discussion paper on Large Language Models (LLMs) and personal data which aserts that the mere storage by a LLM does not constitute data processing, as “no personal data is stored in LLMs”.

Assessment tools to evaluate the residual risk of identification

Any assessment of anonymity of an AI model should also take into account direct access to models. The EDPB provides a non-exhaustive list of criteria for S.As. when assessing anonymity, including:

  • Data source evaluation;
  • Data minimisation techniques;
  • Training methodology;
  • Outputs;
  • Robustness verification; and
  • Documentation review.

In this regard, the Opinion stresses the importance of maintaining comprehensive documentation in relation to personal data processing within AI models such as DPIAs, Data Protection Officer advice or feedback, technical and organisational measures taken both during the design and at all stages during the lifecycle of a model, documentation demonstrating the AI model’s theoretical resistance to re-identification techniques, as well as documentation provided to the controller deploying the model and/or to data subjects.

Legitimate interests

The EDPB confirms that organisations can rely on legitimate interests as a legal basis for processing personal data during the development and deployment of AI models subject to an LIALIA). It  involves a three-step test:

  1. Identify and document the legitimate interest:  clearly define the lawful, specific, and present (i.e., not speculative) interest pursued by the processing activity. Examples of valid legitimate interests in the context of AI include developing conversational agents to assist users and implementing AI systems for fraud detection.
  2. Assess the Necessity of the Processing:  evaluate whether the processing is essential to achieve the stated legitimate interest and if there are no less intrusive means available. The EDPB emphasizes that if the legitimate interest can be pursued using an AI model that does not process personal data, then processing personal data would not be considered necessary.  Controllers must justify the need to use personal data over alternatives like synthetic data in each case.
  3. Perform a ‘balancing test’: weigh the controller's legitimate interests against the fundamental rights and freedoms of the data subjects.  This assessment should consider not only privacy rights but also potential impacts such as discrimination and other harms associated with AI technologies.

The EDPB advises controllers to consider whether individuals would reasonably expect their personal data to be used in the AI model.

Regulatory risks, enforcement actions and practical recommendations

It is no coincidence that the publication of the EDPB Opinion was closely followed by the announcement of the first fine by an S.A. in Europe against OpenAI. The decision of the Italian Data Protection Authority (Garante),  does closely reflect the principles and conclusions reached in the EDPB Opinion in relation to accountability, lawfulness of processing and transparency.

The Garante decision and other developments

The Garante, has recently demonstrated its growing regulatory focus on AI operations and data protection compliance. Following OpenAI’s decision to establish its European Headquarters in Ireland, the DPC has become the lead S.A. for addressing potential data protection breaches by OpenAI. 

On 20 December 2024, the Garante issued OpenAI with a fine of €15million alongside corrective measures for multiple violations related to ChatGPT. This follows an earlier investigation in March 2023 that led to a temporary ban of the AI tool in Italy. The violations cited include:

  • Unlawful Data Processing: OpenAI failed to establish an appropriate lawful basis for processing personal data used for training ChatGPT and for its subsequent public deployment.
  • Transparency Failures: OpenAI did not fulfil its obligations to provide clear and comprehensive information to users about how their data was being processed.
  • Inadequate Age Verification: Insufficient measures were in place to protect minors from accessing and using the platform.
  • Inadequate Risk Assessments: the Garante found that OpenAI had conducted neither an adequate DPIA nor a LIA as required under the GDPR.

Had OpenAI been able to leverage the guidance provided by the EDPB Opinion - particularly its presumption that AI models are not inherently anonymous and its framework for identifying lawful bases and assessing legitimate interest - it might have mitigated some of these issues. However, this guidance came after the fact, leaving ChatGPT's compliance shortcomings as a representative case study for regulators.

Impact of unlawful processing on subsequent use

Unlawful processing during the development phase of an AI model introduces regulatory risks that can have far-reaching consequences for businesses. As highlighted by the EDPB, this can compromise the lawfulness of subsequent processing or operations, exposing organisations to enforcement actions and undermining their compliance posture.

The EDPB’s Opinion explores three scenarios to demonstrate how regulatory authorities might assess the downstream impact of unlawful processing:

  • Scenario 1 (Same Controller): A controller unlawfully processes personal data to develop the model, the personal data is retained in the model and is subsequently processed by the same controller (for instance in the context of the deployment of the model) : Subject to case-by-case assessment, the unlawful processing of personal data in the development phase may be taken into account in an LIA and negatively impact the data subjects if they do not expect the subsequent processing.
  • Scenario 2 (Different Controller): A controller unlawfully processes personal data to develop the model, the personal data is retained in the model and is processed by another controller in the context of the deployment of the model: Each controller is responsible for  ensuring the lawfulness of its own processing, conducting an appropriate assessment and must be in the position to demonstrate that the AI model was not developed by unlawfully processing personal data.
  • Scenario 3 (Anonymised Data): A controller unlawfully processes personal data to develop a model, then proceeds to anonymise the model before the same or another controller initiates further processing the data during the deployment of the AI model:  Subject to the ability for the SA to impose corrective measures on the initial processing, this is the ‘get out of jail’ scenario.  The GDPR is unlikely to apply to the further processing of anonymised data even though the initial processing of personal data was unlawful. 

These scenarios illustrate the significant regulatory risks that arise from non-compliance during the early stages of AI development. Organisations must recognise that unlawful practices during development can have a cascading effect, amplifying liability risks and damaging trust.

Practical recommendations for organisations developing or deploying AI models

The key takeaways for organisations based on the Opinion are as follows:

  • Assess anonymisation: implement the tests provided by the EDPB to ensure an AI model is anonymous.
  • Conduct DPIAs and LIAs: evaluate the necessity and proportionality of data processing at every stage of the AI model’s lifecycle.
  • Strengthen transparency and governance:
    • Provide users with clear information about how their data is used in AI models, including details on automated decision-making processes.
    • Implement robust child protection measures, such as age verification mechanisms, to safeguard minors.
  • Establish internal controls: maintain detailed records of processing activities, governance structures, and compliance mechanisms.

With these steps, organisations can navigate the complexities of GDPR compliance in AI operations, mitigate the risk of enforcement actions, and build trust with users and stakeholders.