@self_made_human's banner p

self_made_human

amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi

14 followers   follows 0 users  
joined 2022 September 05 05:31:00 UTC

I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.

At any rate, I intend to live forever or die trying. See you at Heat Death!

Friends:

A friend to everyone is a friend to no one.


				

User ID: 454

self_made_human

amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi

14 followers   follows 0 users   joined 2022 September 05 05:31:00 UTC

					

I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.

At any rate, I intend to live forever or die trying. See you at Heat Death!

Friends:

A friend to everyone is a friend to no one.


					

User ID: 454

This is very unlikely to be accepted:

  • Too subjective to be useful, and far too ambiguous. Who's doing the grading here? How are they assessing "coherence"? How are we blinding things, if not, how do we account for bias?

  • We strongly prefer actual programming tasks, not creative writing. We could easily ask Claude to write a novel, and it would do it, but then we're back at the issue of grading it properly.

If you want to propose something like this, you need to be as rigorous as @faul_sname up in the thread. At the very least, propose evaluators that aren't you or the two of us, and we can see if it's possible to make this work.

I said "strongly inadvisable" and not "automatically disqualifying".

SF would need to babysit the process, waiting for the person making the request to raise their request, instead of hitting go and checking in periodically or after being alerted. He may or may not be able to do this, he does have a full-time job.

It also injects some degree of ambiguity into things, as well as significantly increasing the time and token investment. Max plans are not infinite.

I stress that this isn't necessarily a deal breaker, it just makes things harder and reduces the likelihood of acceptance. You're at liberty to try asking, and we're at liberty to turn it down, especially should you ask for something outside the original spec (as mutually agreed on in advance).

What I’m saying is you are asking users to come up with examples that they already by definition don’t believe it can accomplish, by definition of their skepticism.

Duh? What on earth could you expect us to do differently? If the skeptic already believes the model to be capable of the task, why ask for a test?

There is non-zero value in discovering a task that both the two of us and the skeptic expect a model to achieve, and then witnessing it failing at it (unexpected, at least), but that is clearly not the primary purpose here. Someone else is welcome to try, after they're no longer swamped with a quadrillion entries. The set of tasks that the skeptics and I both expect models to accomplish is much larger than the one where we disagree.

Hence why I think your claim:

This seems extremely, self-servingly narrow and contradictory. We want to show you how much an AI can do, in order to change your mind on it's limits. But please, do only pick something that it can do. This isn't question begging, but something like it.

Is clearly nonsensical.

By "you", do you mean me (self_made_human)? Oh boy, you'd be surprised. I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.

Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.

I didn't/don't have access to "standard stylometry tools", though I will admit I didn't go looking very hard.

Also, I see several issues with this proposal:

  • As I've happily admitted in the past, I use AI quite often in my writing. That encompasses using them for a) research and ideation (not very contentious, assuming I've done my due diligence and didn't let actual hallucinations through, and I don't recall being accused of that, ever), b) Formatting and rearranging essays I've already written (surprisingly contentious) and c) minor additions to what I drafted in the first place which I saw fit to incorporate (I'd call this contentious if people could actually pinpoint what they were, they can't). I've never shared an essay where I didn't write at least 80% of the text (prior to the editorial step I mentioned).
  • This means that a scrape of my writing corpus is hopelessly "contaminated". I'd have to go back many months before I'm comfortable staying that not even a single word came from an LLM.
  • In order for this to have any hope of blinding, I'd have to write a 1000 word essay myself. Ideally on the same topic, and before I ever saw LLM output This is complicated by the fact that I throw almost everything substantial I ever write these days into an LLM, for critique and fact checking if nothing else. I could do that, God knows it takes very little for me to rattle off a thousand words.

Note that these aren't insurmountable challenges: if the final essay written by Claude falls within the same distribution as what I've been writing (with minor LLM involvement), then that's... fine? If nobody without access to the ground truth knows which essay is which, that's a victory as far as I'm concerned.

(We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity)

Access to all of your public past writing

This would be an absolute pain to collate, both for me and for Claude. Then again, it might be easier for it, right until it ran into the fact that even the largest context windows would choke. I write a lot, and have for many years. Unfortunately, I'm not famous enough to be preferentially scraped, LLMs do not know who I am without looking it up. That's a luxury reserved for Gwern or Scott.

Access to base models Access to fine-tune base or instruct models Access to a vast.ai box with an H100 for 24 hours to do whatever else it wants

Would you be willing to pay for that or provide access? I wouldn't want to financially burden poor SF more than he already is; the Claude Max plan is a sunk cost, while these clearly represent additional investment. Genuine question, I'm sure that Claude could usefully interface with everything if provided access.

Just after I started writing this, the Metal Gear ! sound effect played in my head. Hang on a minute, I don't think I'd describe you as an LLM skeptic according to my (incomplete and rudimentary) taxonomy. I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there. Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?

Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly. Sure, coding capabilities might come into the picture, especially if they're in charge of other models, but it's not as central.

This doesn't strike me as insurmountable. I'd be open to trying once you get back to me on the issues I raised.

I do not see how you can interpret us in that manner.

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points.

If the problem is deemed too hard by everyone (the person proposing it clearly believes the model can't do it), then what exactly does failure demonstrate? Nobody ever expected it to succeed within the given constraints. You can't evaluate automobiles in terms of their ability to reach Alpha Centauri. You can't adjudicate a debate between a Ferrari fanatic and a Lambo lover based on which car is more effective at deep sea exploration.

It takes disagreement on model capabilities and (expected) outcomes for all of this to be surprising or useful.

As we've clearly stated later, if we agree to the challenge, then we expect that the model can do something (that our counterparty thinks it can't), so the failure of the model goes against us, and will force us to update.

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread. They seem reasonable enough to me, but I am clearly not the real expert here, and I'll be deferring to his judgment. That might take a little while to organize, I'll edit this into the main post for the sake of clarity.

See For Yourself: A Live Demo of LLM capabilities

As someone concerned with AI Safety or the implications of cognitive automation for human employability since well before it's cool, I must admit a sense of vindication from seeing AI dominate online discourse, including on the Motte.

We have a wide-range of views on LLM capabilities (at present) as well as on their future trajectory. Opinions are heterogeneous enough that any attempt at taxonomy will fail to capture individual nuance, but as I see it:

  • LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.

  • LLM moderates: Current SOTA models are useful, but incapable of replacing even mid-level devs without negative repercussions on work quality or code performance/viability. They do not fully substitute for the labor of the average professional programmer in the West. This may or may not achieve in the near future. AGI is uncertain, ASI is less likely.

  • LLM skeptics: Current SOTA models are grossly overhyped. They are grossly incompetent at the majority of programming tasks and shouldn't be used for anything more than boilerplate, if that. AGI is unlikely in the near-term, ASI is a pipedream.

  • Gary Marcus, the dearly departed Hlynka. Opinions not worth discussing.

Then there's the question of whether LLMs or recognizable derivatives are capable of becoming AGI/ASI, or if we need to make significant discoveries in terms of new architectures and/or training pipelines (new paradigms). Fortunately, that isn't relevant right now.


Alternatively, according to Claude:

The Displacement Imminent camp thinks current models already threaten mid-level knowledge work, and the curve is steep enough that AGI is a near-term planning assumption, not a thought experiment.

The Instrumental Optimist thinks current models are genuinely useful in a supervised workflow, trajectory is positive but uncertain, AGI is possible but not imminent. This is probably the modal position among working engineers who actually use these tools.

The Tool Not Agent camp thinks current models are genuinely useful as sophisticated autocomplete or search, but the "agent" framing is mostly hype — they fail badly without tight human scaffolding, and trajectory is uncertain enough that AGI is not worth pricing in.

The Stochastic Parrot camp (your skeptics, minus the pejorative) thinks the capabilities are brittle, benchmark gaming is rampant, and real-world coding performance is far below reported evals. They're often specifically focused on the unsupervised case and the question of whether the outputs are actually understood vs. pattern-matched.

The dimension you might also want to add explicitly is who bears the cost of the failure modes — because a lot of the disagreement between practitioners isn't about raw capability but about whether the errors are cheap (easily caught, low stakes) or expensive (subtle, compounding, hard to audit). Someone who works on safety-critical systems has a very different prior than someone shipping web apps.


Coding ability is more of a vector than it is a scalar. Using a breakdown helpfully provided by ChatGPT 5.2 Thinking:


Most arguments are really about which of these capabilities you think models have:

  1. Local code generation (Boilerplate, idioms, small functions, straightforward CRUD, framework glue.)

  2. Code understanding in situ (Reading unfamiliar code, tracing control flow, handling large repos, respecting existing patterns.)

  3. Debugging and diagnosis (Finding root cause, interpreting logs, stepping through runtime behavior, reproducing bugs. Refactoring and maintenance)

  4. Changing code without breaking invariants, reducing complexity, untangling legacy.

  5. System design and requirements translation (Turning vague specs into robust design, choosing tradeoffs, anticipating failure modes.)

  6. Operational competence (Tests, CI, tooling, dependency management, security posture, deploy and rollback, observability.)

Two people can both say “LLMs are great at coding” and mean (1) only vs (1)+(2)+(6) vs “end-to-end ticket closure.”


With terminology hopefully clarified, I come to the actual proposal:

@strappingfrequent (one of the many Mottizens I am reasonably well-acquainted with off-platform), has very generously offered:

  1. A sizeable amount of tokens from his very expensive Claude Max plan ($200 a month!) and access to the latest Claude Opus.

  2. His experience using agent frameworks and orchestration. I can personally attest that he was doing this well before it was cool, I recall seeing detailed experimentation as early as GPT-4.

  3. His time in personally setting up experiments/tests, as well as overseeing their progress, while potentially interacting with an audience over a livestream.

He works as a professional programmer, and has told me that he has been consistently impressed by the capabilities of AI coding agents. They've served his needs well.

Here's his description of his skills and experience:

in my professional capacity, I've been working with Python for back-end (computer vision algorithms, FastAPI, Django) & Java (Spring). For Front-end; React. 95 percent of what I do is boilerplate, although Sonnet 3.5 did help me solve a novel problem last year but it did take quite a bit of back & forth -- the key was discussing what additional metrics I could capture to help nail down ~30+ parameters influencing a complicated computer vision pipeline.

tldr; the more represented your use case is in the training corpus, better results (probably) -- but I am absolutely confident that Opus 4.6 can help with novel problems, too. And, y'know -- Terrance Tao thinks that as well.

To what end?

He and I share a dissatisfaction with AI discourse that substitutes confident assertion for empirical investigation, and we think the most useful contribution we can make is to show the tools actually working on tasks that skeptics consider beyond their reach.

What do we want from you?

If you self-identify as someone who is either on the fence about LLMs, or strongly skeptical that they're useful for anything: share a coding challenge that you think they're presently incapable of doing, or doing well.

An ideal candidate is a proposal that you think is beyond the abilities of any LLM, while not being so difficult that we think they'd be entirely intractable. Neither of us claim that we can solve Fermat's Last Problem (or that Claude can solve it for us).

Other requirements:

  • A clear problem specification, or a willingness to submit a vaguer one and then approve a tighter version as created by us/Claude.

  • Nothing so easy/trivial that a quick Google shows that someone's already done it. If you want a C++ compiler written by an LLM, well, there's one out there (though that is the opposite of trivial).

  • Nothing too hard. He provides an example of "coding a Netflix clone in 4 hours".

  • An agreement on the degree of human intervention allowed. Can we prompt the model if it gets stuck? Help it in other ways? Do you want to add something to the scope later? (Strongly inadvisable). Note that if you expect literally zero human intervention, SF isn't game. He says: "I don't think I'd care to demonstrate any sort of zero-shot capacity... that's a silly expectation. If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say It doesn't work; troubleshoot, please. I trust your judgement."

  • A time-horizon. Even a Max plan has its limits, we can't be expected to start a task that'll take days to complete.

  • Some kind of semi-objective rubric for grading the outcome, if it isn't immediately obvious. Is it enough to succeed at all? Or do you want code that even Torvalds can't critique? And no, "I know it when I see it" isn't really good enough, for obvious reasons. Ideally, give us an idea of the tests everything needs to pass.

  • If your task requires the model to review/extend proprietary code, that's not off the table entirely, but it's up to you to make sure we can access it. Either send us a copy or point us at a repo.

  • Nothing illegal.

But to sum, up, we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z." Otherwise I can only imagine a lot of post-hoc goalpost movement, "okay but it still needed 3 prompts" or "the code works but a good programmer would have done it differently."

We reserve the right to choose which proposals we attempt, partly because some will be more interesting than others, and partly because we have finite tokens and finite time.


Miscellaneous concerns:

Why Claude Opus 4.6?

Well, the most honest answer is that @strappingfrequent already has an Claude Max plan, and is familiar with its capabilities. The other SOTA competitors include Gemini 3.1 Pro and GPT 5.3 Codex, which are nominally superior on a few benchmarks, but a very large fraction of programmers insist that Claude is still the best for general programming use cases. We don't think this choice matters that much, and the models are in fact roughly interchangeable while being noticeably superior to anything released before very late 2025.

Why bother at all?

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points. We will share logs and a final repo either way. An interactive livestream is possible if there is sufficient interest.

Anything else?

You know the model. We'll be using Claude Code. The specifics of the demo are TBD, it could be a livestream with user engagement if there's sufficient interest, otherwise we can dump logs and share a final repo.

The floor is open. What do you think Claude cannot do?


Edit:

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread himself. I am clearly not the real expert here, plus he's doing all of the heavy lifting. Expect at least a day or two of back and forth before we come to a consensus (but if he agrees to something, then assume I won't object, but not vice versa unless specifically noted), and that includes conversations within this thread to narrow down the scope and make things as rigorous as possible while adhering to the restrictions we've mentioned. At some point, we'll announce winners and get to scheduling.

Very few, but still non-zero. Classic examples would be Ender's Game; then we've got HPMOR and other rat-fic.

Exactly, if a woman marries me for my money, extends me love and attention, raises my kids, watches me die of natural causes and then goes to the Bahamas to cry on a cruise ship, I'm not really seeing the issue here.

There are very few women who don't care about money at all. I ask the married male Mottizens here to consider what would happen if they suddenly gave away all their money, quit their jobs and then told their wives that. "But don't you love me for who I am?", you'll have to cry plaintively as she files papers and takes the kids.

She's never been permabanned. I seem to recall her saying she'd lost the password to her previous account, and she then turned down our offer to restore it.

Thank you, that's the one. My internal betting market had strong odds in favor of you being the first to find the link, good to see I'm well-calibrated.

doesn't quite match your assertion

Hmm. It seems I was misremembering. I will weaken from saying that 18 (or my speculation of 16) being peak female attractiveness isn't supported by the graph.

I will note:

as you can see, men tend to focus on the youngest women in their already skewed preference pool, and, what's more, they spend athe median 30 year-old man spends as much time messaging teenage girls as he does women his own age significant amount of energy pursuing women even younger than their stated minimum. No matter what he's telling himself on his setting page, a 30 year-old man spends as much time messaging 18 and 19 year-olds as he does women his own age. On the other hand, women only a few years older are largely neglected.

I think this supports part of my argument: namely, that by setting an age minimum at 18, OKCupid obscures the fact that many/most men would happily approach younger women if they had the option. I suppose this is even less controversial, women don't magically go from being divorced of sexual value at 18 years - 1 Planck time to being hot when the clock strikes 12 on their 18th birthday.

Also look at the charts titled "The shape of the dating pool" and "how a person's attractiveness changes with time":

The latter shows that 18 year old women are about 75% as attractive as they are at their absolute peak at 21. They are roughly twice as attractive as they would be at 34. This strongly implies that women below 18 are more attractive than the majority of older women, the range restriction just doesn't allow us to measure this.

I had an ex who was actually two years older than me, but could have passed as 18 without much hassle. I visited London with her when I was 26ish, and she was 28. I remember getting dirty looks at a liquor store with her on my arm as we were gawking at the variety of booze on offer. The next time, when she went alone, she got even dirtier looks, and was finally accosted by both a random old granny and the lady at the till on suspicion of underage drinking. It was funny in hindsight, as much as women complain about getting carded, they're even more upset when it stops.

On the other hand, excluding venues where they have a policy of carding anyone who walks in, I haven't been specifically asked for ID since I was 16. I can only presume that the we were giving off the impression of a sizeable age gap.

Anecdotes aside, I think the primary driver of age gap discourse is the bitterness of a specific age group of women engaged in intrasexual warfare that spills out into intersexual forms.

Ages 25-35, I'd say. Just young enough to be terminally online, unlike even older women who grew up and settled down this before this was capital-d Discourse. (There are very few grannies out there who are going to lecture their granddaughters about dating a 35 yo when they're 22.)

They notice that the youth they once prized is fading, and while they're still perfectly happy to go for older men (as are almost all women), they resent the fact that the men in their ideal age range don't consider them to be in their ideal age range.

Lip-service to feminism makes it difficult to directly attack their direct competitors (younger girls), without coming off as bitter and butt-hurt. But you can attack the men. And if you can successfully pathologize male preference for youth as predatory, you accomplish two things simultaneously: you make the competing demographic seem like victims who need protection rather than rivals, and you make the men who prefer them seem like villains.

This reframing has the additional advantage of being unfalsifiable in ways that make it rhetorically robust. Any counterexample, any young woman who says she's perfectly happy in her relationship and was not victimized, can be explained as evidence of how thorough the manipulation was. She doesn't know she's a victim. That's the worst part.

The frontal-lobe argument is where things get especially interesting. The claim is that the prefrontal cortex isn't fully developed until 25, therefore people under 25 lack sufficient judgment to consent to relationships with older partners. I've seen this argument made by people with actual MDs on /r/medicine, which I find both impressive and alarming. It's impressive because it successfully launders a social preference into neuroscience. It's alarming because it's bad neuroscience.

Neurodevelopment is continuous. The "fully developed at 25" framing suggests a step function where below 25 you're basically a golden retriever and above 25 you're suddenly Immanuel Kant. This is not how brains work. The research shows gradual changes in certain cognitive and regulatory processes, with enormous individual variation, and basically no evidence that this translates into systematic inability to make reasonable decisions about relationships.

The younger girls? They absorb this by cultural osmosis. Younger Gen Z is actually the most vocal about age-gap discourse. Unfortunately (or fortunately), that isn't enough to overcome their innate biological preference for older, successful men, so actual behavior doesn't change much. If a 20 year old girl meets a 30 year old man she thinks is cute, she'll usually have few qualms about sleeping with him or getting into a relationship, age-gaps be damned.

Power-disparity is bad? Huh, someone should tell all the women who prefer that kind of disparity, in favor of the men they desire. Men tend to be more focused on attributes such as physical attractiveness and youth, which are, no prizes for guessing, more common in younger women.

I find such pathologization of universal human preferences distasteful, doubly so when my field is molested and forcefully conscripted to shore up bad arguments. Oh well, so be it. I'm lucky enough to be a MILF enjoyer and thus immune from direct blowback for the most part, even if I regretfully note that "MILF" increasingly just means women my age.

(Another anecdote: I remember grinding on a girl I vaguely knew at a club in Scotland. An older friend of mine had a thing for a bisexual woman about the same age as me. She ended up chatting with the first girl, who seemed receptive to her advances. Then the girl disclosed that she was 19, and that made the woman freak out, as they later explained in our company. I put aside any plans to approach the girl later, since the headache was far from worth it.)

If I was less lazy/busy, I'd insert the usual OkCupid stats blogs/archives from before they were bought and cucked. They showed that female attractiveness peaked at 18, but that was their minimum age cutoff, so I suspect the actual figure is even lower at around 16. Men also showed tolerance to wider age gaps as they got older. 30 year old and 35 year old men showed roughly the same willingness to approach 25 year old women.

I believe Gwern has a copy. Someone please do this in the comments, thanks, :*

Just look at what the side bar on the blog is titled.

I think my actual favorite by Watts is the Sunflower series/novella. There's no scope for heavy handed ecological metaphors, just good old fashioned scifi and existential dread.

That's like saying Einstein and a village idiot both suffer from the "same" problem, they stub their toes at equal rates. Or saying that a drunk Asian grandma and a professional F1 driver are as incompetent because F1 drivers crash their cars too.

How often they fail is important.

Now that's actual insanity. I presume you mean you used GPT 3.5 (because that was the version in the first public ChatGPT release) vs GPT-4.

The actual GPT-3 was a base model, it wasn't instruction tuned.

I actively used GPT 3.5 when I was learning how to code, and found it useful but frustratingly inaccurate. I remember trying GPT-4 during the same period, and it was so much better that I gave up all aspirations of directly switching from medicine to programming and ended up becoming a psychiatrist. Regardless of how good the AI was, I noticed that it was getting better, faster than I was. An excellent choice in hindsight.

If that is your serious opinion, then that is a genuine reason to discount anything you have to say about LLMs. You didn't even need benchmarks, it was as obvious as the performance difference between a rickety tuktuk and a Honda Civic.

Also reading Legend of the Galactic Heroes again.

I tried watching the anime, after it seeing it shared as an example of a "rational"(ish) anime.

The first episode (all that I bothered watching) disappointed me greatly. The so-called strategic genius won a fleet battle against all odds by using tactics obvious to a particularly bright seven year old. Someone tell me if it's worth persisting despite poor first impressions.

Sure, if we're being strict about things. But then there's everything else Watt says, which makes me feel justified in saying that was his subtext/implication. He comes out and says so!

I'm probably misremembering. I think I've read the book at least 5 times, but the probably over a year ago.

The point still stands: we have limited insight into the actual degree of consciousness in a sleepwalking state. It's clearly abnormal, but our understanding of neuroscience can't confidently say that since the ability to form longterm memories is largely disabled, that means that consciousness, if present, can't be reported by the sleepwalker later (the same reason you start forgetting a dream as soon as you wake up).

If you've ever lucid dreamed (I haven't, sadly) then that demonstrates the ability to be aware and at least partially conscious during REM sleep. Sleepwalking is NREM behavior, sure, but it's not possible to say that the sleepwalker is entirely unconscious, we just don't know.

Even if they're performing complex motor behaviors, I strongly suspect that overall performance is hampered. They might (in rare cases), drive a car, but I doubt they drive as well as they would fully awake. I could be wrong, but without the ability to subject an active sleepwalker to a battery of cognitive tests, I'll stay here. It's a very tricky subject to study.

Eh, I have mixed feelings on the topic. Watts did his best to rationalize the concept with evobio, but that only gets you so far with vampires. It's kinda cool, but they're far from plausible organisms.

Oooh they're scary dangerous predators that would murderise us all if they could. Yeah, and so could great white sharks, with their dead shoe-button eyes.

Unlike sharks, vampires are depicted as both amoral/murderous, and more intelligent than us silly humans.

We're not going to be murdered by sharks any time soon, and the sentimentality around the way some people treat them accords perfectly well with the stupidity of, as you point out, letting the vampires walk around unfettered. I can easily believe some people would be greedy and stupid enough to think they could make pets out of vampires and use them for PROFIT. But the vampires themselves? There's nothing there, they're just automata. Or sharks, perfect killing machines but no higher goal than that.

The thing is, they don't roam around entirely unfettered! In-universe, they're recognized as highly dangerous, and mitigation measures are put in place:

  • The original vampires were highly territorial hypercarnioveres who couldn't stand competition. The resurrected ones had those tendencies ramped up, they were described as murdering each other if allowed to enter close proximity. Think shoving two male tigers into the same enclosure.

  • Their handlers thought that this instinctual intolerance of their own kind would prevent scheming and conniving. They were very, very wrong. The exact mechanisms by which the vampires coordinated their rebellion are excellent, probably one of the best depictions of the power of decision theories for modeling and coordination. They just imagined what they'd do if in the place of another vampire, and vice-versa, solved for the equilibrium, and acted, independently and simultaneously, without ever having to actively exchange information with their kin. Hats off.

  • The crucifix glitch was weaponized against them, the belief was that if they went off the reservation, they'd die painfully as soon as the drugs that stopped them from having painful and lethal seizures wore off.

The humans weren't entirely complacent, but they were still unforgivably insufficiently paranoid about creatures smarter than them, which they knew to be hostile by default. The Vampires consistently use their superior physical prowess to murder normal humans, not just their brains.

So why even let them have that physical prowess? It doesn't take a genius to say that "hey, maybe we should give them the grip strength of an obese 4channer". The Vamps were kept around for their brains, not their brawn. It added nothing while making them a greater threat. This is, as far as I'm concerned, giving the humans an idiot ball. The ways the vampires circumvented their other shackles is understandably hard to predict without the benefit of hindsight. Tearing people apart with their bare hands isn't.

You know what? I don't think he is engaging with the article. The article specifically mentions GPT 5.2 Pro seven times, two of which seem, to my read, to imply that that's what he's using. There is one moment where he just says "GPT 5 Pro". Perhaps he just happened to leave off the ".X" in this one spot. Perhaps I'm reading the other seven mentions of GPT 5.2 Pro wrong, and the dirty secret is that he's using 5.0. I suppose he doesn't say in big bold highlighted words, "I'm definitely using 5.2 and not 5.0," so sure, maybe one could say that it would be nice to have a clear statement.

I checked, and this seems correct.

On that basis, I can't really disagree with your claim that @Poug didn't engage with the article. Being charitable, it's exceedingly common to see this happen in the wild, so he might have jumped to conclusions, but neither you, nor the author, seems to have made that kind of error and it's unfair to criticize you on those grounds.

Sure Rorschach is more advanced than humanity, but that obviously doesn't prove that consciousness is a drag any more than someone taller and balder than you indicates that hair is keeping you short.

Rorschach is explicitly described as a p-zombie/Chinese Room, and is used as an existence proof for superintelligence without qualia or consciousness. I struggle to separate in-universe speculation from author fiat, I doubt that Watts is the kind to devote that much screentime to an idea without partially endorsing it.

It's the most technologically advanced entity in Sol, it's doing very well for itself, and all without being conscious. I think that constitutes a claim that consciousness isn't particularly important.

Anyway, after writing this, I had GPT 5.2 Thinking check the version hosted on Archive for direct quotes:


From Siri’s internal monologue near the end (the book’s most on-the-nose anti-sentience passage):

“It begins to model the very process of modeling. It consumes ever-more computational resources, bogs itself down with endless recursion…” � Internet Archive

“Metaprocesses bloom like cancer, and awaken, and call themselves I.” � Internet Archive

“The system weakens, slows… advanced self-awareness is an unaffordable indulgence.” � Internet Archive

“This is what intelligence can do, unhampered by self-awareness.” � Internet Archive

That last line is basically your exact request in one sentence.

In the Notes and References: consciousness as interference, nonconscious competence In the back-matter discussion of consciousness (Watts stepping partly out of “story voice”):

“Consciousness does little beyond taking memos… rubber-stamping them, and taking the credit for itself.” � Internet Archive

“The nonconscious mind… employs a gatekeeper… to do nothing but prevent the conscious self from interfering…” � Internet Archive

“It feels good… makes life worth living. But it also turns us inward and distracts us.” � Internet Archive

“While… people have pointed out the various costs and drawbacks of sentience, few… wonder… if… it isn’t more trouble than it’s worth.” �


It also found a full interview where Watts, out of universe says:

It finally occurred to me that if consciousness actually served no useful function – if it was a side-effect with no adaptive value, maybe even maladaptive – why, that would be a way scarier punch-in-the-gut than any actual function I could come up with. It would be an awesome narrative punchline for a science fiction story. So I put it in.

Of course, not being any kind of neuroscientist, I had no doubt that I’d missed something really obvious, and that if I was lucky a real neuroscientist would send me an email setting me straight. At least I would have learned something. It never occurred to me that real neuroscientists would start arguing about whether consciousness is good for anything. In hindsight, I seem to have just blindly tossed a dart over my shoulder and hit the bullseye entirely by accident.

https://milk-magazine.co.uk/interview-peter-watts-sci-fi-novel-blindsight/

https://x.com/lauriewired/status/2020006982598685009?s=20

This is the closest I've ever come to seeing usage in the wild, and Laurie claims it's applied by some flavor of analyst. I suppose it's neat?

Well, I don't see myself crossing the bright line of actually posting my essay here and then begging for votes. I think simply soliciting suggestions and mentioning a rather extensive list of potential candidates I've come up with is probably fine. I don't think @ScottA would mind.

So you may want to avoid stating what your final decision on this topic is.

Fair enough, but I'm still in the concepts-of-a-plan stage.

Did you know that visualizing data in the form of faces is an actual technique?

https://en.wikipedia.org/wiki/Chernoff_face

Making them screaming faces? Subtlety is a lost art.

I do not think it's fair to say that @Poug didn't engage with your post.

If you say:

It seems to me to be a balanced take. He's bullish and hopeful on the future, while trying to be accurate/realistic about current capabilities, while remaining somewhat concerned about possible problems

Then it is entirely fair to point out that the person you're using as an authority isn't using cutting-edge models that correctly capture "current capabilities". A few months is a very long time indeed when it comes to LLMs.

That is all I have to say, and I mean it. I'm not a professional mathematician, I can't attest to their peak capabilities as a primary source. The last time I was able to was when I got my younger cousin (a Masters student then, now postgrad in one of the more prestigious institutions here) to examine their capabilities in my presence.

"Is the one-point compactification of a Hausdorff space itself Hausdorff?" was a problem that I could actually understand, after he showed me the correct answer. The LLMs of the time were almost always wrong, 6 months later we got mixed results , but as early as a year ago, they get it right every time (when restricting ourselves to reasoning models, and you shouldn't use anything else for maths).

Now? He went from being skeptical about my claims of near-term AI parity in mathematics to what I can only describe as grim resignation.

(Now being six months ago, last time I saw him.)


In the interest of fairness, I think @Poug is probably incorrect when he says:

But you will also notice the absense of issues you are facing

I'm not saying this with confidence, because that's just my recollection of what actual mathematicians say these days, including Tao himself. I just mention it to hopefully demonstrate that I'm trying very hard not to be a partisan about things.

It's excellent to see you living up to the latter half of your username. Here, have a cookie for good behavior.