User safety is core to Anthropic’s mission of creating reliable, interpretable, and steerable AI systems. As we launch new ways for people to interact with Claude, we also expect to see new types of potential harm materialize, whether through the generation of misinformation, objectionable content, hate speech or other misuses. We are actively investing in and experimenting with additional safety features to supplement our existing model safety efforts and are working to provide helpful tools to a wide audience while also doing our best to mitigate harm. Launching new products in open beta allows us to experiment, iterate and hear your feedback. Here are some of the safety features we’ve introduced:
Detection models that flag potentially harmful content based on our Usage Policy.
Safety filters on prompts, which may block responses from the model when our detection models flag content as harmful.
Enhanced safety filters, which allow us to increase the sensitivity of our detection models. We may temporarily apply enhanced safety filters to users who repeatedly violate our policies, and remove these controls after a period of no or few violations.
These features are not failsafe, and we may make mistakes through false positives or false negatives. Your feedback on these measures and how we explain them to users will play a key role in helping us improve these safety systems, and we encourage you to reach out to us at usersafety@anthropic.com with any feedback you may have. To learn more, read about our core views on AI safety.