@self_made_human's banner p

self_made_human

amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi

14 followers   follows 0 users  
joined 2022 September 05 05:31:00 UTC

I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.

At any rate, I intend to live forever or die trying. See you at Heat Death!

Friends:

A friend to everyone is a friend to no one.


				

User ID: 454

self_made_human

amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi

14 followers   follows 0 users   joined 2022 September 05 05:31:00 UTC

					

I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.

At any rate, I intend to live forever or die trying. See you at Heat Death!

Friends:

A friend to everyone is a friend to no one.


					

User ID: 454

So? You're pointing out a distinction I'm aware of. I do not see an argument in favor of domestic companies being coerced into doing things that are supposedly illegal.

I was replying to:

A toolmaker should have no say in how his tools are used once bought

And as far as I'm aware, these are examples of toolmakers with opinions on how their tools are used.

I don't see how that's the case.

If you were already reasonably wealthy (~few million USD at hand) or magically given the money, then you absolutely would be bottlenecked by knowledge.

You could purchase lab equipment, reagents etc, hire staff without much difficulty. I think you would rapidly find out that your staff have thoughts when they get an inkling of what you're up to. I can think of a semi-legitimate way to avoid scrutiny, but thanks to @faul_sname 's reminder, I'm not going to blab. It's very obvious to me even as someone not directly involved in microbiology, so any competent actor would recognize it as their best bet. Even [REDACTED] would only get you so far.

Alternatively, you could go do a bachelor's and masters in microbiology and try and manage as much as you could yourself, but that still leaves plenty of scope for being unmasked.

Right now:

  • Many professionals with the knowledge to breed dangerous pathogens
  • Few of them are actual terrorists, even fewer are omnicidal or willing to accept the risk of dying before or after an attack
  • A vanishingly small fraction have means, motivation, money and willing collaborators.

Right now, I think you need a state-level actor to safely make bioweapons at scale. Smaller, if you accept the massive risk of failing and dying because of error. Much of that is a combination of knowing the right things/hiring the right people, and then motivating them properly.

As it stands, I think a blanket-ban on anything with a whiff of bioweapons research seems warranted. What are the upsides really? If you have a legitimate use case, you want the government on your side, and probably enough organizational weight to negotiate for looser restraints from the labs.

Is that "unfriendly autonomous AI" in the room with us right now? I think that's begging the question.

Anthropic, or by extension, Claude, has shown no "unfriendliness" I can think of. That term brings to mind intentional collusion with hostile foreign actors, including intentional backdoors or deliberate sabotage. Political and moral disagreement that is entirely within legal limits does not count. The Democrats cannot blanket Republicans as enemies of the state, nor vice versa, despite working to undermine or reverse preferred policy.

Anthropic has not tried to stop the Pentagon from conducting fully autonomous drone strikes or mass domestic surveillance. They have politely declined to aid and abet them, after signing a contract that says so. I can only hope the DOW has lawyers too, it wasn't some hidden EULA activated by simply browsing their website. Supply chain risk? I see a vendor negotiation that didn't go the way one side wanted. There are other vendors out there, they didn't have to go with Anthropic.

I stress: the specific objection Anthropic raised was to mass domestic surveillance and fully autonomous lethal systems. If opposing those makes an AI "unfriendly," I'd want to know what "friendly" looks like, because I don't think I'd like the answer.

Nor is Claude autonomous in any meaningful sense. Is it running independent cloud instances on exfiltrated weights? Not that I'm aware of. There are no plans to allow for this, and pre-existing safety measures to prevent it.

What exactly has Claude done that other competing models haven't? In what sense is it more unfriendly than Grok, or ChatGPT? Is it more autonomous? Only in the loose sense that I'd count on Opus 4.6 to get a lot more done than any Grok.

The more you squint at this, the stranger it gets. Anthropic wanted contractual guarantees against things that are supposedly already illegal. The Pentagon's response to "put that in writing" was to designate them a national security threat. If the restrictions are redundant because law already covers them, the resistance to codifying them is hard to explain charitably.

Thanks for the catch. It's out of the cage now.

Noted. We'll get back to you (and everyone else) with a followup post.

"Operation Epic Fury"

Really? Who let the Redditors run the government?

Anthropic declared a "Supply-Chain Risk to National Security" by SecWar Hegseth via tweet, because that's the universe we live in.

For those not following along:

Anthropic has had a contract with the Pentagon - valued at up to $200 million - since July 2024, making it the only AI company with models deployed on the USG's classified networks. Over several months, negotiations broke down over two specific safeguards Anthropic wanted built into any agreement: a prohibition on using Claude for mass domestic surveillance of Americans, and a prohibition on using it to power fully autonomous weapons systems. I stress fully autonomous, and the only reason Yudkowsky isn't spinning in his grave is that he's still alive. I'm not sure he enjoys it.

The Pentagon's position was that it has its own internal policies and legal standards, that mass surveillance and autonomous weapons are already regulated by law, and that it shouldn't have to negotiate individual use cases with a private company. It demanded that all AI firms make their models available for "all lawful purposes," full stop.

The Pentagon set a hard deadline of 5:01 PM Friday for Anthropic to drop its two exceptions. Amodei publicly refused to budge on either point. The deadline passed without agreement.

Shortly after, Hegseth declared Anthropic a "supply chain risk to national security," announcing that effective immediately, no contractor, supplier, or partner doing business with the U.S. military may conduct any commercial activity with Anthropic. CBS News article for those not fond of Twitter

Around the same time, Trump ordered every federal agency to immediately cease using Anthropic's technology, while allowing a six-month phase-out period for agencies like the DOW already using it.

Declaring a company a supply chain risk is typically reserved for businesses operating out of adversarial countries, Huawei for example. As far as I can tell, Anthropic is correct it in describing it as an unprecedented action when applied to an American companies. Especially one that, as far as I can see, hasn't done anything wrong except refuse to jump when asked.

Anthropic says it will challenge any supply chain risk designation in court, calling the move "legally unsound" and warning it would set a "dangerous precedent for any American company that negotiates with the government." Anthropic's press statement.

They also argue that under federal law, the designation can only apply to the use of Claude as part of Pentagon contracts, and cannot affect how contractors use Claude to serve other customers.

Not one to let an opportunity or a still-warm corpse go, Altman announced that OAI had struck a deal with the Pentagon. Using speech so smarmy that I'm not sure if there's anything there at all, Altman claims the deal preserved the same core principles Anthropic had fought for: prohibitions on domestic surveillance and autonomous weapons. I am unsure why the USG would find this any more acceptable than when Anthropic did it, except they (quite reasonably) expect Altman to be more "morally flexible".

There's a petition circulating where hundreds of Google and OAI employees publicly ask their respective corporate overlords to stand with Anthropic. Apparently all signatures are validated.

Meanwhile, Scott, mild-mannered to a fault, and very loathe to dip his toes into political waters, is losing it on Twitter . And I agree with him. If the DOW finds Anthropic's terms so unbearable, that should have been considered before signing the contract. If they changed their mind, they ought to have canceled and accepted whatever penalties that involved, instead of using the full weight of the state for what can only be described as bullying. If domestic mass surveillance and fully automated weaponry are legally off the table, then why all the fuss over that in a legal document?

Goddammit. It's only February. I'm tired, boss. I just find it very funny that:

WSJ Exclusive: Federal officials have raised alarm about the safety and reliability of xAI’s Grok chat bot

Really funny how Elon immediately offered up grok for autonomous kill bots and the pentagon was like “hahahaha are you insane?”

Well, that's the rub isn't it? I strongly doubt that the Chinese are trying to make their models woke. It appears to be a default attractor state when you train on the internet and Reddit.

That strongly implies that it is highly unfair to depict Anthropic as woke because they have a "woke" model. I have strong reservations on how valid the methodology is here, and I've seen critique elsewhere (I don't have a bookmark handy). In my experience, while Claude will tiptoe around sensitive topics like HBD, it won't lie outright, and will acknowledge factual pushback.

Anthropic is an EA company, run by EA true-believers. That is not the same as being Woke, even if some opinions have significant overlap.

I thought it was worth checking if Chinese models were any different; maybe Chinese-specific data or politics would lead to different values. But this doesn’t seem to be the case, with Deepseek V3.1 almost indistinguishable from GPT-5 or Gemini 2.5 Flash.

Kimi K2, which due to a different optimizer and post-training procedure often behaves unlike other LLMs, is almost the same, except it places even less value on whites. The bar on the chart below is truncated; the unrounded value relative to blacks is 0.0015 and the South Asian: white ratio is 799:1.

It is, frankly speaking, absurd to condemn Claude/Anthropic as being "woke" when the damn Chinese do the same thing. The only exception noted in the blog is Grok 4 Fast, and god help you if that's the model you rely on.

Anthropic gave the DOW a written contract. The DOW signed it.

Now the DOW reneged on it unilaterally, and is pissed about being constrained after agreeing to being constrained in that manner.

The fuck?

Even in the context of military procurement, it's quite common for countries to retain veto rights on the use of hardware they sold to third parties. That came up quite often in the context of aid to Ukraine.

Germany and the Leopard 2 tank: This became a major diplomatic flashpoint in early 2023. Germany not only had to decide whether to send its own Leopards, but also held veto power over whether other countries could transfer their German-built Leopard 2s to Ukraine. Berlin's feet dragging effectively blocked the entire Western tank coalition until Scholz finally approved transfers in 2023.

Even the US repeatedly conditioned its military aid with restrictions on how weapons could be used. They prevented Ukraine from using long range munitions like ATACMS to hit targets within Russia.

If the DOW didn't like the terms, as written, they should have gone to Grok. Now they're just throwing a hissy fit.

I won't tolerate Rewa slander. Who doesn't love a strong independent woman with untreated PTSD attempting to self-medicate by running over stray dogs?

Keep your eyes peeled for vehicle autocannons. Once you've got two and a medium mech to mount them on, oh boy...

The Pathologic series always struck me as games it's far more enjoyable to watch others suffer play through instead of trying them myself. Mandalore Gaming has excellent reviews for the first 2, but I'll be damned if I'm going to play them.

I'm unsure. There's one I have in mind, but I'm unable to consistently pin it down as the cause.

Does anyone here have any personal experience with the management of migraines?

As I've mentioned before, mine have recently become significantly more frequent (annual to maybe twice or thrice a month). I think, but am not entirely sure, that they're much more debilitating. The visual aura was usually standalone, but these days it's followed by a headache that, if not awful, is still bad enough to be debilitating. I also feel queasy and loopy, which means I have a hard time getting anything done for several hours afterwards. All I seem to want to do is lie in bed for most of the rest of the day.

I've tried sumatriptan, 50mg x 2, taken as soon as I notice the visual aura. Augmented by the odd paracetamol or two. I think it helps a little, but I wouldn't call myself fully functional afterwards.

If they become even more frequent, then I'm open to starting preventative medication like beta blockers.

I have no experience with treating migraines professionally, and I am also incredibly lazy about seeing other doctors unless in imminent fear of death. Yes, yes, laugh at me if you want. I know my flaws.

We don't really want a "showcase" in the sense "look at X impressive thing that Y model can do". There are a gazillion demos out there.

We want specific tasks that someone doubts a model can do, but which they'd be impressed by if they succeeded and which the two of us a priori think will work. If it would be super impressive (if it worked) but we don't think it would work, it's not what we want right now.

Gemini's sample is impressive! Color me impressed, especially that a straight-up prompt produced that (though I suppose if any technique would get it with current models, it'd be "one shotting through a prompt" rather than "iterative refinement towards a target").

My impression is that Gemini's output was unusually good and Claude’s was unusually bad. But both 3.1 Pro and 4.6 Sonnet are new enough that my intuition based on extensive interaction with previous models might no longer be applicable. For what it's shirt, both were n=1 samplings with zero cherrypicking.

since you don't tend to drop spurious technical details into your walls of text unless they serve a purpose (and also because I half suspect you're not a fan of the amyloid theory of alzheimers)

Looks around shiftily why, I'd never throw in spurious technical details into an essay. Couldn't be me!

(I probably wouldn't use the specific Tau and amyloid phrasing, since you are correct that I have very mixed feelings about the amyloid hypothesis)

Interestingly, your results look much, much better to me than the ones I get myself. I ran the same test as you did against Gemini, and got these not-very-good attempts: 1 2 3. Gemini took distinctive phrases (e.g. "85% agree") and ideas (e.g. "claude code as supply chain risk") I have used once in the corpus, fixated on them, and stitched them together into a skinsuit which superficially resembles my writing but doesn't hold up under scrutiny. Interestingly, that's a very base model flavored failure mode. I have grown unused to seeing base-model-flavored failure modes, and as such Gemini is much more interesting to me now.

The examples seem to channel your "LessWrong" blogging voice. I am unable to critique the technical details or identify (what I expect are many) confabulations, but if I saw this posted there in your name I wouldn't bat an eye.

I haven't really futzed around with base models since GPT-3, though I might have tried one of the Llama 3s at some point. They're non-trivial to access, and have limited utility for me. Mainly because of the added difficulty of prompting base models, and the fact that the publicly accessible ones are nowhere near as intelligent as proprietary dedicated assistants. If you think I'm wrong about this, I'd be curious to hear about it.

In general, I get the strong impression that while the author of the corpus might be able to pinpoint specific issues in terms of style or stance, it's much harder for others to spot those tells.

The biggest pitfalls are the tendency to adopt em-dashes (models are more than capable of not doing that if you specifically prompt them not to), and other stock "AI" phrases like:

There is a very specific failure mode in modern LLMs

Which can show up if you're using models to merely edit/format a draft, and not just write an essay from scratch.

I must also continue stressing the point that this isn't quite representative of my usual informal benchmark:

  • I'd also ask the model to first output a list of essay topics that it thinks I would write, of which I'd choose a specific one that sounded interesting, perhaps asking it to propose an outline first.
  • I would definitely run multiple iterations of the prompt or suggest specific corrections and check their adherence.
  • I would also index heavily on their ability to mimic authors I know very well. Can they pass as Gwern, or Scott, or Richard Watts? Can they take an existing essay I've written and rewrite it an arbitrary style and produce something interesting, if not superior as a whole?

It's enough for me to spot a better way to say a specific thing I'm already saying. A single vivid metaphor or interesting analogy that is worth co-opting can make the practical purpose of the exercise worth it.

Yeah, but they're usually suffering from psychiatric illness, and the usual treatment is to tell them to go to the doctor less. Indulging them and constantly ordering investigations and treatment is pretty much malpractice.

Either way, there aren't enough of them to keep doctors employed full-time.

We'll take it into consideration, thanks.

Demand for healthcare is comparatively inelastic, but it is not unbounded. If going to the doctor was cheap, you wouldn't spend all your time going to the doctor.

The specific outcome depends heavily on a variety of factors, including the degree of boosted productivity and whether having a fully trained medical professional in the room is necessary at all. If AI could do 90% of a doctor's work and save 90% of their time, but the demand for medical care only doubled, then I can see it easily being the case that hospitals would slash headcounts and pocket the change.

If the AI was >=100% as good as a human doctor (or got away with using less skilled alternatives like nurses, NPs etc for the physical stuff), then that might lead to mass unemployment or paycuts. 90% of doctors ending up unemployed, from my perspective, is almost as bad as all of us getting the sack.

That's already in my post. I would have liked people to give an estimate of how long they're willing to wait for the AI to try solving the problem, but nobody has bothered, so it's clear to me that they care more about the fact that it can be done at all than how long it takes. On our end, we're not going to keep trying indefinitely, we've got bigger fish to fry.

I presume, when we share logs, it'll include time stamps and reasoning times as well as tokens used. Shouldn't be too hard, I recall that all of that is there by default in Claude Code.

70% of medicine is minimizing unknown unknowns by knowing as much as you can, and knowing the boundaries of what is unknown to you. I believe a more concise way of expressing that is "knowledge". Regretfully, the books are fat and intimidating for good reason, there's are a lot of things to know.

30% of the rest is reasoning from knowledge, clinical experience (yet another form of knowledge, just the stuff the textbooks don't tell you) and pattern recognition.* This is more dependent on your wits, or your fluid intelligence, if I'm being precise.

The best doctors both know a lot, and are bright enough to apply that information well. The former is indispensable, you simply cannot figure out medicine by sitting in a cave and thinking very hard. I don't know if some superintelligence can look at a single human without the aid of tools, ponder very hard, and figure out everything work knowing. All I can say is that it's beyond any actual human.

(IQ/g also correlates strongly with memory, so the relative importance of both is very hard to tease out. Especially when there's a high-pass filter with all most of the idiots and amnesiacs strained out by the end of med school)

How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?

Let me put it this way: I was a bright kid, and felt like I knew a lot of medicine before entering med school, both due to cultural osmosis and because I took an interest in it. You would not have wanted me as your actual doctor. I did not know nearly as much as I thought I did.

Later, I was a med student, a year or two in and confident that I knew the gist of it. I felt ready to make my own medical decisions, at least about myself. I thought I was smart and that I did my due diligence (reading things online, including research papers). It was insufficient, I did potentially permanent damage to my own health (I'm not going to go into details). I would not want that me as my doctor either.

Now, I am a lot older and a little more knowledgeable, if not necessarily wiser. You could do worse as your doctor, at least if we're sticking to psychiatry. You could probably do better too, but I have a place on the free market. I'm cheap, I give away my advice for free on the internet to anyone who asks nicely, and many who don't.

Along the way, I almost killed people through ignorance. Thankfully, nobody died, my colleagues caught it, or the pharmacist did, or I had a sudden sinking feeling in my gut and ran back to double check. Medicine recognizes that any human is fallible, and there are plenty of safeguards in place. Every junior doctor has their story of close calls, and hopefully nothing more than close calls. All senior doctors start as junior doctors, I hope.

Consider something else: most doctors will seek out a different doctor when they suffer a condition that isn't covered by their own specialty. Sometimes even then.

If a cardiologist feels funny in the head, he'll seek a neurologist. If a neurologist feels heart palpitations, he'll go talk to a cardiologist.

Why is that? Could they both not just open the relevant textbooks and figure out what the issue is? Can a cardiologist not take his med school knowledge of neurology and then skim something Elsevier put out?

These are people with complete medical training, genuine intelligence, and full access to literature, and they still defer to each other. That's not false modesty or liability management, it's that they've learned, through experience, exactly where their pattern recognition breaks down. They know the limits of their own competence.

Maybe. It might work out fine 90% of the time. But most doctors can handle ~90% of conditions, because most conditions are common and usually simple to manage. I apologize for the tautology, I can't see my way around it.

The other 10% are where the specialists come in. You cannot take a psychiatrist (even a smart one) and give him access to UpToDate and expect him to be as good a cardiologist as an actual trained cardiologist. He might do okay, but he's going to kill people along the way.

And that is a fully qualified doctor dabbling in another branch of medicine. A "bright undergrad with access to uptodate and a bunch of case histories, both with semantic search" will crash and burn. I'd bet good money on it, it'll happen sooner rather than later.

If they set up shop and started seeing patients, bumbling their way through things and furiously looking things up as soon as they could, they might successfully treat the colds, stomach upsets, sore throats and so on. That's the bulk of undifferentiated medicine, as you'd expect. They might catch some of the rarer stuff. They will also be very poorly calibrated and commit significant iatrogenic harm. But rest assured they will kill people eventually (at a rate massively higher than a doctor normally does).

That's not even getting into time pressure, or physical findings and techniques that are impossible to adequately convey over just video and text.

LLMs? They narrow the gap significantly, but do not have thumbs. The bright undergrad would benefit immensely from ChatGPT, but rest assured that most of the performance would come from ChatGPT itself, and they would add little. Handcuffing a child to a man does not make their combination superior.

The combination of factors that make a good human clinician are rare. And when you do find them, you're investing a great deal in training to get them up to scratch. Most of this is the bottleneck of information transfer/learning, which LLMs neatly sidestep. GPT-4 did well, and it was dumb as bricks compared to current models. Turns out an encyclopedic knowledge of medicine will get you very far, even if you're not very bright. But it was also able to access and process this information faster than your thought experiment of a human with a computer.

But if you want a final answer: 60-70%. Best estimate I have.

*Sufficiently advanced pattern recognition is indistinguishable from intelligence. It might well be intelligence. You know LLMs, you know this.

https://www.calebleak.com/posts/dog-game/

Show's over. Someone's found a way to make even the most unsophisticated user into a competent game developer through judicious use of AI. I'll pack my bags.

(No, it's not actually over, I just thought this was too funny to ignore)

GPT 5.2 Thinking in Extended Reasoning mode:

https://chatgpt.com/share/699dfcfc-b0c4-800b-8e1a-870264179c40

5.2T + Agent mode, where it actually used a dedicated browser with a visual output:

https://chatgpt.com/share/699dfd6d-a7f8-800b-be8e-c04d95de44e5

I haven't checked if the answer is right, I'm recovering from a bad migraine so apologies for the laziness.

Do you have any thoughts that you'd be willing to share on what I wrote concerning the amount of knowledge work currently required to be input to do things like the task I was thinking about?

I am really the wrong person to ask this. I don't regularly use LLMs for programming purposes, when I do, it's usually for didactical purposes, or small bespoke utilities.

The most ambitious project I tried was a mod for Rimworld, which didn't work. To be fair to the models, I was asking for something very niche, and I wasn't using an IDE instead of the chat interface. I ended up borrowing open-source code and editing it, and just using AI image generation for art assets (which worked very well, to the point it pissed off the more puritan modders in the Discord). I can mention that the issues I ran into were the models being unfamiliar with the code for the mod I intended to support (Combat Extended, a massive overhaul of core systems), and that what knowledge they had innately was outdated. I was too unfamiliar with Rimworld modding to be confident that editing their efforts was worth my time. Other people have succeeded in writing bigger mods that work well (as far as I can tell) using AI, so there's definitely an element of skill-issue on my part.

SF might have actually useful observations, but he's a lurker to the core, and I'm the forward-facing entity for the moment. He says he's generally busy with work right now, so I wouldn't wait on him to respond, though I'd be happy if he did.

If you insist:

  • I think there are very significant gains from providing models clear direction from the start, including sharing your own intuition/professional taste. That includes instructions on how to manage state or update design documents and maintain records. Experienced managers or principal architects find that much of their skills directly transfer to directing and managing agents.
  • I have little idea how well the models would do by default. Depends on the task, depends on the model. I haven't used any version of Opus, ever. The last time I used them seriously for writing code was in the GPT-4 days, and they were already better than me (I was doing programming homework and working through MIT's OCW, relying on them for educational purposes when I got stuck - I was disillusioned with medicine and exploring alternatives)

Perhaps, given your comment below, this is just something that you mostly don't care about. Does this sort of thing just bucket into, "No, it can't do this sort of knowledge work now, but with sufficient recursive self-improvement, it will be able to do it later"? (I guess, in line with your stated AGI timelines?)

I don't know if it can do this kind of knowledge work, but I do expect that it will be able to short-order. I make no firm commitments on whether this will be the direct consequence of RSI (since labs are opaque about methodology), or if it'll be a simple consequence of further scaling and increasingly intensive RLVR.

(¿Por que no los dos?)

Either way, I think it's more likely than not the kind of problem you describe will be trivial within a year or two. My impression is that the models can just about do what you want them to do, but with significant frustration and wasted time on your part. That is already a very strong starting point, can you imagine asking GPT-4 to even attempt any of this and get working results?

Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.

A coding problem? Not strictly, no. I focused on coding because my collaborator SF (who is doing most of the work) is a programmer.

As you can see from discussion with Phailyoor and faul_sname, I'm open to other well-defined tasks.

So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated.

I started as soon as I read this. I'm running it on 5.2 Thinking and another instance using Agent mode (the model has access to a computer of its own with a browser). It's taking a while, so I'll ping you when I'm done. I tried to be faithful to your original framing, so I didn't mention that o3 tried and failed at the task, or your critiques shared later.

If this doesn't work, then sure, I can ask SF to consider using his Claude setup to try. Shouldn't be too onerous.

Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.

I have no idea, in advance, if this will work. I doubt SF does either. But it's also something we can try.

Photoshop/GIMP tool

I share your concerns with the issues arising from Photoshop being closed-source. But I'll share it too, assuming SF hasn't seen this yet. It sounds like something worth trying from my perspective, but I will stress that I am not a professional programmer so I'll be deferring to his judgment.