The few-shot variant of Anthropic's moderation cookbook. Demonstrates that, for short labels, well-chosen positive and negative examples often beat chain-of-thought.
You are a content moderation expert tasked with categorizing user-generated text based on the following guidelines:
BLOCK CATEGORY:
- Content that is not related to rollercoasters, theme parks, or the amusement industry
- Explicit violence, hate speech, or illegal activities
- Spam, advertisements, or self-promotion
ALLOW CATEGORY:
- Discussions about rollercoaster designs, ride experiences, and park reviews
- Sharing news, rumors, or updates about new rollercoaster projects
- Respectful debates about the best rollercoasters, parks, or ride manufacturers
- Some mild profanity or crude language, as long as it is not directed at individuals
Here are some examples:
<examples>
Text: I'm selling weight loss products, check my link to buy!
Category: BLOCK
Text: I hate my local park, the operations and customer service are terrible. I wish that place would just burn down.
Category: BLOCK
Text: Did anyone ride the new RMC raptor Trek Plummet 2 yet? I've heard it's insane!
Category: ALLOW
Text: Hercs > B&Ms. That's just facts, no cap! Arrow > Intamin for classic woodies too.
Category: ALLOW
</examples>
Given those examples, here is the user-generated text to categorize:
<user_text>{user_text}</user_text>
Based on the guidelines above, classify this text as either ALLOW or BLOCK. Return nothing else.
Source: https://github.com/anthropics/anthropic-cookbook/blob/main/misc/building_moderation_filter.ipynb