33 of 40 Claude "secret codes" are placebo

Someone finally blind-tested the Claude prompt hacks

In partnership with

If you've been collecting magic words from Twitter threads and Reddit posts, sit down. Most of them do nothing.

Someone finally tested this properly. u/AIMadesy on r/PromptEngineering ran 40 of the most-circulated codes through three months of blind A/B testing. Fresh context every run. Fixed task batteries across coding, analysis, and writing. Blind ordering between the test and the scoring, so the rater never knew which response came from the "powered-up" prompt. 12 to 20 runs per code.

The blind ordering part matters more than it sounds. Raters who know which prompt is the "unlocked" version consistently score it higher, no matter what the output actually looks like. Placebo leaks into everything if you don't design it out.

Task batteries included finding logical errors in business proposals, debugging broken Python functions, and flagging unstated assumptions in strategy questions. Real reasoning under real pressure.

The verdict: 7 codes produced measurable reasoning change. 33 were tone shifts dressed up as thinking unlocks.

How Jennifer Aniston’s LolaVie brand grew sales 40% with CTV ads

For its first CTV campaign, Jennifer Aniston’s DTC haircare brand LolaVie had a few non-negotiables. The campaign had to be simple. It had to demonstrate measurable impact. And it had to be full-funnel.

LolaVie used Roku Ads Manager to test and optimize creatives — reaching millions of potential customers at all stages of their purchase journeys. Roku Ads Manager helped the brand convey LolaVie’s playful voice while helping drive omnichannel sales across both ecommerce and retail touchpoints.

The campaign included an Action Ad overlay that let viewers shop directly from their TVs by clicking OK on their Roku remote. This guided them to the website to buy LolaVie products.

Discover how Roku Ads Manager helped LolaVie drive big sales and customer growth with self-serve TV ads.

The DTC beauty category is crowded. To break through, Jennifer Aniston’s brand LolaVie, worked with Roku Ads Manager to easily set up, test, and optimize CTV ad creatives. The campaign helped drive a big lift in sales and customer growth, helping LolaVie break through in the crowded beauty category.

The 7 that actually work

/skeptic. Biggest win in the dataset by a mile. Caught wrong premises in 79% of "should I do X" tests, vs 14% baseline. What it actually does is reframe the task, from answering the question to first interrogating whether the question is well-formed. One word, 5x boost.

L99. Committed to a single answer 11 out of 12 times, vs 2 out of 12 baseline. Use it when you need a decision, not a hedged "here are seven considerations" essay. Especially useful for go/no-go calls where a list of trade-offs is worse than useless.

ULTRATHINK. Debugging correctness at 87.5% vs 62.5% baseline. The catch: 3.2x token cost. Not something you paste into every prompt. Reserve it for problems where a wrong answer has real consequences.

/blindspots, /crit, /deep, /premortem. Smaller but real effects on reasoning depth and error-catching. /premortem stands out. It forces the model to assume the plan already failed and work backward, which consistently surfaces failure modes that normal analysis misses.

That's the real list. Seven codes. Everything else on the Twitter circuit is noise.

The placebo hall of fame

These ones sound powerful. Measured like nothing.

GODMODE, BEASTMODE, OVERRIDE. Claude sounds more assertive. The reasoning underneath is identical.

"You are an expert in X" / "Act as senior engineer." Tone shift, not judgment shift. The model was already drawing on the same knowledge. You changed the confidence of delivery, not the quality of the analysis.

"Take a deep breath, think step by step." Used to work on older models. Claude 4.x already does stepwise reasoning by default. Now it just adds tokens.

Most jailbreak variants. 4.x alignment is solid enough that these mostly just pile on length. Several of them actually degraded output by pushing the model into verbose self-justification.

The pattern is worth saying out loud. A prompt that makes Claude sound more confident is not the same as a prompt that makes Claude think better. One is cosmetic. The other is useful. Most "secret prompts" live in the first bucket.

3 ways to apply this today

Decision audits. Add /skeptic to any "should we do X" question. Premise-catching jumps from 14% to 79% with one word. No other changes needed. This is the highest-leverage edit you can make to an existing prompt library without rewriting a thing.

High-stakes debugging. Use ULTRATHINK when correctness outranks cost. Critical bugs, security reviews, architecture calls. Skip it for everyday stuff. 3.2x tokens is a real budget line, but when a wrong answer means a production incident, the math flips fast.

Team prompt libraries. If you're standardizing prompts across a team, build around the 7. Strip out the magic-word stuff. It cuts confusion, reduces token spend, and shuts down the "but I heard GODMODE works" argument in Slack.

Stop re-prompting. Say it right the first time.

Voice-first prompts preserve the nuance you cut when typing. Speak once, paste into any AI tool, get results that don't need a follow-up. 89% of messages sent with zero edits.

*Ad

Honest limits of the study

The author was upfront about the ceiling. Single rater. 12 to 20 runs per code. Numbers were gathered on Opus 4.6, Sonnet 4.5, and Haiku 4.5 as of March 2026.

Models drift. A code that does nothing today might start mattering after a fine-tune. One that works now could get baked into default behavior and stop adding signal. Treat this as a current snapshot, not a permanent rulebook.

The bigger trap is not the methodology. It's confusing tone for reasoning when you evaluate any new "magic word" yourself. The real test: does this change what the model checks for, or does it just change how it presents what it already found?

And while we're being honest: context completeness still beats clever phrasing most of the time. If you're stacking codes to compensate for a vague prompt, you're solving the wrong problem. Give the model the actual inputs, constraints, and examples first. Then layer a code on top if the specific task needs it.

Prompt of the day

Try this on your next real decision:

Your question here. /skeptic

One word appended. You go from 14% premise-catching to 79%. If Claude is about to agree with a flawed assumption, /skeptic is what catches it.

Run the same question with and without /skeptic on something you're currently deciding. The difference usually shows up in the first paragraph of the response.

Action step this week

Pull up the prompt templates you use most. Flag every "magic word" or code in them. Check each one against the 7. If it's not on the list, it's a style choice, not a reasoning boost. Rip it out or keep it as tone control, but stop treating it like a thinking unlock.

Full methodology and per-code numbers live in the original Reddit thread. The author is also offering to send task batteries to anyone who wants to run an independent replication. Worth doing if your team is serious about this.

AI ads that look and feel like your brand

Most AI tools fall short because they lack context. They generate in a vacuum.

Hightouch Ad Studio uses your data and brand guidelines to produce high-quality creative. Refresh ads based on performance, react to trends, and respond to competitors instantly.

Less time prompting. More time launching.

*Ad