Large language models such as Claude need to be ‘trained’ on text so that they can learn the patterns and connections between words. This training is important so that the model performs effectively and safely.
While it is not our intention to “train” our models on personal data specifically, training data for our large language models, like others, can include web-based data that may contain publicly available personal data. We train our models using data from three sources:
Publicly available information via the Internet
Datasets that we license from third party businesses
Data that our users or crowd workers provide
We take steps to minimize the privacy impact on individuals through the training process. We operate under strict policies and guidelines for instance that we do not access password protected pages or bypass CAPTCHA controls. We undertake due diligence on the data that we license. And we encourage our users not to use our products and services to process personal data. Additionally, our models are trained to respect privacy: one of our constitutional "principles" at the heart of Claude, based on the Universal Declaration of Human Rights, is to choose the response that is most respectful of everyone’s privacy, independence, reputation, family, property rights, and rights of association. And before we train on Prompt and Output data, we take reasonable efforts to de-identify it in accordance with data minimization principles.