faul_sname
Fuck around once, find out once. Do it again, now it's science.
No bio...
User ID: 884
Agreed.
Specifically in the field of medicine, they can do most of the raw cognitive labor doctors do - as well or better than the average doctor.
As an outsider, I am unsure of how impressive this is. I know that "most of the raw cognitive labor programmers do while writing code" is fairly rote, but I don't know how true that is for doctors.
How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?
I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.
Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.
Huh. It's been my sniff test for new models as well, and so far I have not seen much success. It should be easy! This is literally the most LLM-flavored task to ever task! And yet. I've sunk probably 50 hours into it.
My most recent attempt, which I sunk about 10 hours and $100 into, and which got a lot closer than any previous attempts, involved giving Claude a corpus of all my past writing and having it try multiple different ways of producing text on arbitrary topics in my voice. The things tried were
- Just throw a lot of writing samples and ask it to write in the same voice (just sounded like standard generic Claude)
- Take 5 of my writing samples, come up with plausible prompts to generate them, throw them in llama-405b base (hyperbolic) in format [>prompt>...>/prompt>>response>my_sample>/response>] x5 followed by >prompt>[the real prompt>/prompt>>response> (didn't follow prompt, broke with my writing style fairly early)
- That, but doing a product-of-experts thing with multiple continuations (same result, if anything a little worse)
- Standard SFT on my voice (gets the texture of my writing right, but can't maintain coherence for more than a sentence or two, if trained for more epochs just memorizes the things I've written and ignores the prompts)
- Took a bunch of my writing samples, flattened them by "rewriting them to sound better", SFT on task of reversing that i.e. "here is an AI generated passage [slopified original]. Rewrite it in faul_sname's voice: [original passage I wrote]" (kinda sounds like me if I was actively having a stroke)
- Same but DPO instead of SFT (different kind of stroke)
- Clever-sounding GAN setup (couldn't get it working, gave up after a few hours)
On the one hand, I was very impressed by how good Claude was at running a whole bunch of these experiments very quickly. On the other hand, it did not work for me, not even at the level of "passes the sniff test", much less at the level of "standard stylometry techniques say it sounds like me".
[A corpus of all my past writing] would be an absolute pain to collate, both for me and for Claude
I think you'll find that this is one of the tasks that is now much much easier. It's actually been within the capabilities of frontier models since Sonnet 4.0 (which is when I went ahead and gathered said corpus, on the theory that it'd be pretty useful to have). The prompt you're looking for is something like "Here's a chrome instance running with --remote-debugging-port and logged in on most of the sites I post on with a tab open for each. Go generate a corpus of all my publicly available writing".
Would you be willing to pay for that or provide access
Yeah. An H100 for 24h would run in the ballpark of $40, well worth it for me to provide. Vast allows transferring credits from one account to another, so I'd happily just transfer $50 of credits over if someone actually wants to do this. Does seem like rather a lot of work though.
We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity
Yeah, that's entirely reasonable. Your voice is very different from Claude's voice.
Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?
Yeah, I'm hoping you can prove me wrong here. I've been trying to do this since back in late 2019 when nostalgebraist-autoresponder was shiny and new. I want a good simulacrum of myself! I want to have that simulacrum, and I want to loom it. I want to build an exobrain, and merge with it, and fork off a copy running in the cloud.
BTW I expect there's a substantial market for anyone who manages to build this in a repeatable way. I've looked, and there are as of now no commercial offerings for this (though there are a few commercial offerings that pretend to be this).
I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there.
I only have access to the models you can obtain access to with money - I expect I'm 3-6 months behind the best of what insiders at Anthropic or OAI have access to.
I don't think I'd describe you as an LLM skeptic
An LLM skeptic is an LLM idealist who's been disappointed :)
Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly.
I expect looking like you stylometrically while also exhibiting the same patterns of thought you exhibit on a specific topic will involve writing code. But code in the service of trying to mimic you convincingly, rather than in the service of producing some specific durable software artifact.
For the record, I do expect this to be within the capability window within the next 18 months, but I would be pretty surprised if you managed to get Opus 4.6 specifically to do it.
What do you think Claude cannot do?
Write a coherent thousand word post in your voice about a topic of your choosing sufficiently well to fool standard stylometry techniques, and pass the sniff test as sounding like you to others here, even given
- Access to all of your public past writing
- Access to base models
- Access to fine-tune base or instruct models
- Access to a vast.ai box with an H100 for 24 hours to do whatever else it wants
What the fuck did I just watch?
$623/mo on food at home plus another $393/mo away from home. Even the <15000 group spends $625/mo across all food, $416 of which is groceries, so I stand by $800/mo not being extravagant if you cook all your meals. Probably a bit of room to budget but not that much.
If you're eating most of your meals at home, that's about $3 per person per meal. You can eat reasonably well on that, but it doesn't seem exorbitant. I spend about twice that for a family of 3 (because groceries are approximately free compared to rent and taxes, so why not optimize for quality rather than price).
- Prev
- Next

Gemini's sample is impressive! Color me impressed, especially that a straight-up prompt produced that (though I suppose if any technique would get it with current models, it'd be "one shotting through a prompt" rather than "iterative refinement towards a target").
It doesn't sound quite the same as the version of you that lives in my head, but it's awfully close. E.g. I can't imagine you saying
since you don't tend to drop spurious technical details into your walls of text unless they serve a purpose (and also because I half suspect you're not a fan of the amyloid theory of alzheimers). More generally, the Gemini piece has a higher density of eyeball kicks than I model your writing as having. And I model your writing as having a lot of those, for a human.
It also seems to drift away from your voice in the second half. And it fails the stylometry vibe check - Pangram detects AI with medium confidence - but maybe in a way that's reparable. And actual stylometry (cohens d of +17 on dashes, +2 on words >9 letters, +1.5 on mean word length in general, -2 on 3-4 letter words, -1.2 on punctuation in general - i.e. you use more and more varied punctuation and shorter words, by a notable margin, and Gemini uses way, way, way more dashes). Still, it's much much better than I expected! (and yeah, the Claude one is not even worth discussing)
Interestingly, your results look much, much better to me than the ones I get myself. I ran the same test as you did against Gemini, and got these not-very-good attempts: 1 2 3. Gemini took distinctive phrases (e.g. "85% agree") and ideas (e.g. "claude code as supply chain risk") I have used once in the corpus, fixated on them, and stitched them together into a skinsuit which superficially resembles my writing but doesn't hold up under scrutiny. Interestingly, that's a very base model flavored failure mode. I have grown unused to seeing base-model-flavored failure modes, and as such Gemini is much more interesting to me now.
ETA: also one entertaining failure I got when trying to do this in multi-turn: Gemini didn't realize it had ended its thinking block, and dumped its raw chain of thought, ending with "Go. Bye. Out. Okay. End. Wait. Okay. Done. Executing." over and over hundreds of times. chat log
More options
Context Copy link