self_made_human
Kai su, teknon?
I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.
At any rate, I intend to live forever or die trying. See you at Heat Death!
Friends:
I tried stuffing my friends into this textbox and it really didn't work out.
User ID: 454
https://x.com/mbalunovic/status/1907436704790651166
Gemini 2.5 Pro was released on the same day as the benchmarks, so data contamination seems rather unlikely. You'd expect contamination on all the questions, and not just two.
Thanks for the ping. As I've always said, getting models to do any better than chance is the biggest hurdle, once they're measurably better than that, further climbs up the charts are nigh-inevitable.
Depending on how old she is, she might be comfortable with withdrawal rate that's greater than the normal 4% that's usually thought of as the "safe withdrawal rate". As others have said, a managed fund is almost always a bad call.
Yup. I forgot the AMA is analogous to the BMA, and that you guys don't have a central regulator like we do in the form of the GMC.
Remember OSA cause psychiatric disturbance. The DSM emphasizes that psychiatric problems are not otherwise caused by a substance or medical problem. Consider this with respect to correlation and causation. A strict doctor will not diagnose someone with depression with untreated OSA for the same reason you won't diagnosis someone with mania if they are high on meth.
While this is good care, I rarely see this implemented in practice. GPs bear the brunt of insomnia, though of course it's a perennial concern in psych. I've never seen my bosses actually order formal assessments for sleep apnea, not that we have the equipment to do PSG in a psychiatric hospital.
There are more CPAP and other OSA treatment options than you can imagine. Obviously the UK probably has some limitations on this front but you'd surprised where you can go with this. Don't give up (for yourself OR the patients)!
I sleep just fine these days, thankfully. When it comes to patients, I'll keep an eye on whether GPs have attempted to exclude behavioral or medical causes like OSA, but if they do, it never seems to come up in the notes or referral letters. No mention of CBT-I that I can recall, even if NICE recommends it as the first line intervention, and there's even a dedicated app.
You are correct! I always overstate Benzo risk with patient facing communication and writing because what we are certain on is quite bad enough and it's probably better to emphasize what we may later find to be true. AND...in the best case the Benzo is impairing establishment of more definitive care anyway. I find most patients and clinical staff underestimate just how addictive Benzos are.
Even if the dementia risk wasn't an issue, just the addiction potential and cognitive effects should put most reasonable people off them. The former is nasty, but I guess they beat barbiturates.
Thanks for the pointers!
feel the need to caution you against believing everything you read in the marketing materials
Anthropic is a reputable company on the cutting edge of AI, so I'd ask you for concrete disagreements instead of an advice for generalized caution.
Here are other relevant studies on the topic:
https://openreview.net/forum?id=KRnsX5Em3W
https://openreview.net/forum?id=fMFwDJgoOB
https://aclanthology.org/2023.findings-emnlp.68/, an older paper from 2023.
This applies the same standard about the ability to differentiate truth from fiction that is used to justify that belief in humans.
Further, as models get larger, hallucination rates have consistently dropped. I recently discussed a study on LLM use for medical histories which found 0% and ~0.1% hallucination rates. As I've said before, humans are not immune to hallucinations or confabulations, I'd know since I'm a psych trainee. That's true even for normal people. The only barrier is getting hallucination rates to a point where they're generally trustworthy for important decisions, and in some fields, they're there. Where they're not, even humans usually have oversight or scrutiny.
A model that doesn't hallucinate doesn't turn your vacation pictures into a Hayo Miyazaki frame either and the latter is where the money and publicity are.
There is a difference between hallucination and imagination. That is just as true for LLMs as it is for humans. Decreasing hallucination rates do not cause a corresponding decrease in creativity, quite the opposite.
I found the old /r/Medicine thread when this study first came out, and I saw that even the, the majority of opinions were skeptical. I'd upvoted all of them.
It's a shame that there isn't much legal penalty for such knowing academic fraud, but if any of the authors are doctors, the AMA should throw their license in a bin and light it on fire. Knowing what I do about the AMA, that is unfortunately rather unlikely.
Excellent work! God knows that sleep disturbances keep many a shrink employed, so I have the bare minimum clinical experience to comment.
A lot of insomnia isn't primary, as you said. I'd know, I went half a decade being told I snore by my parents, and then my ex, before I bit the bullet and had a polysomnography suite conducted. I ended up with a diagnosis of borderline severe sleep apnea, I was going minutes between breaths. I was quire happy with the quantity and quality of my sleep, but I did notice tiredness and daytime somnolence for a while. Unfortunately, as someone with moderate-severe depression, it's unclear which way the causality ran. I was slightly overweight but not obese. I also had a mildly deviated nasal septum.
Even more unfortunately, I found out that I just couldn't tolerate a CPAP. My willingness to look ridiculous, while there, doesn't extend that far, especially when the machine was noisy and the strap quite uncomfortable. I gave it a few nights, didn't notice much change, and surrendered. I'd lie if I said I tried any lifestyle modifications either, so I have immense sympathy for patients who don't adhere to our exhortations.
More recently, I noticed a combination of insomnia, early awakening and tiredness in conjunction with my depression worsening for (reasons). I ended up with a prescription for mirtazapine, which definitely helped with sleep (it's unclear if the improvement in my mood was because it worked, or because I finally found a stimulant for ADHD that didn't make me feel awful or keep me up all night).
I think I'd consider myself a nightmare patient, one who was too impatient to bother with sleep hygiene. The meds worked, though insomnia wasn't the sole reason for starting them.
Meds don't work part 3: Okay meds do work. The situations in which they do work are complicated and beyond the scope of this lecture. Snowing someone with Seroquel to help them sleep because they are manic is not unreasonable.
One day US doctors will learn to use generic names. Get with the times old man. Quetiapine isn't a bad sedative, but just about nobody would prescribe it outside of psychosis or manic conditions because of the laundry list of side effects. If there's anybody doing it just for sleep, haven't heard about it.
Same thing with aggressive grandma who is awake at night and hitting the nurses (or wait - no, you aren't supposed to do that, except sometimes you are...complicated).
While my bosses try to be Good Clinicians, and refrain from attempting to treat such behavioral disturbances with medication where feasible, I'm afraid that in the majority of cases, the demented granny gets a nice cup of covertly administered lorazepam. It makes the nurses lives easier, and by extension, ours.
Okay, the effect sizes of meds are pretty small. Some seem to work better but are more expensive. Some are extremely dangerous for one reason or another. BZDs cause dementia, IQ loss and all cause death. Ambien causes sleep driving. If you are actually sleeping on these is a complicated question.
Funny you'd bring that up. I was just talking about it with my boss, and in my attempts to show clinical curiosity and ingratiate myself, I brought up the topic of benzos and dementia. I'd already read up about it, the causation is fraught. It's very hard to rule out protopathy, where the anxiety and insomnia of early depression forces doctors to resort to benzos, with the actual diagnosis of dementia coming later. In his opinion, the evidence is too mixed to speak about conclusively. Especially in the elderly, where sedation risks outweigh other concerns (half the patients already have dementia, what does it really matter if their risk increases?). They will, however, make you dumber and increase mortality.
Because of all the above it is EXTREMELY easy to get VERY poor quality care for insomnia. Insomnia is miserable, patients have preconceptions about works, they are very demanding. Even if YOU aren't very demanding many doctors will be like "fine, whatever" because they don't want to have "one of those" conversations. Be careful.
Because of my own personal issues with leading a healthy lifestyle, I tend to be unusually sympathetic and non-pushy about lifestyle modifications when patients are recalcitrant. I still think thrice before prescribing benzos beyond a short course, though something like melatonin is so harmless that it's almost always worth a shot. (I lean towards Scott's opinion that most clinicians are retarded and prescribe far too high a dose. An adult is best off starting at 300mcg)
I've mentioned the new drugs targeting orexin receptors. They're big news, a novel hypnotic with good effect sizes and no demonstrated dependency risks??? Maybe we can move away from benzos. I haven't prescribed them, haven't seen them prescribed, but I have read the literature and will just about say they're worth a shot and better for you than benzos.
My working algorithm for someone with sleep issues:
- Politely ask them to try lifestyle modifications.
- They fail. My condolences. Offer them CBT-I. This has never worked for anyone I've seen in clinic or in hospital.
- Undergo polysomnography, or at least ask a partner if they snore or stop breaking at night. If this reveals sleep apnea, reinforce previous advice, if not, consider CPAP, splints or mouth tape.
- Trial melatonin.
- Doesn't work? I won't prescribe benzos longterm, but I will tentatively endorse drugs like daridorexant.
You might want to look into newer drugs that target orexin receptors, like daridorexant. I've discussed them above.
Benzos and z-drugs are right out. Melatonin is only modestly effective. Antihistamines? Eh, they're not my first choice for prolonged use, they have deleterious effects on cognition.
Very recently, a new class of sleeping pills that target orexin receptors has come out on the market. An example would be daridorexant. They're reasonably effective, and surveillance for 12 months of sustained use hasn't demonstrated any addictive properties (and we're much better at evaluating that, not like when benzos were marketed as non-habit forming).
They seem to me to be significantly more effective than melatonin, and roughly on par with benzos (which do work! They're just risky and addictive). Ask your doctor if they're willing to prescribe for you, but I haven't seen them used myself (they're very new and the consultants haven't caught on).
Others already pointed out how none of the insights you credit Rationalists with are unique to them, nor were they the first ones, so I'll skip over that.
They were directly responsible for the promulgation of those concepts and popularizing them, first in the tech sphere, and then just about globally.
The man who caused a flash of light when he accidentally shorted a primitive battery isn't credited with the invention of the lightbulb, the person who made them commercially viable is.
This is only true to the extent that their primary goal is not letting anyone else have the AI-god. Their preferred outcome is still for AI to exist, they just want it to be 100% under control of people with Rationalist values. So while there exists a set of circumstances where I might end up allying with them, their actual goals are one of my nightmare scenarios, and I'm much more aligned with the average population on this issue
Religious people seem to believe that a God exists (and the major strains think that this entity is somehow omnipotent, omniscient and omnibenevolent). Those who don't, think that something even approaching those values is a Good Thing.
The majority of Rats don't think an aligned ASI is strictly necessary for eudaimonia, but it sure as hell helps.
Besides, the only actual universal trait required to be a rationalist is to highly value the art of rationality and to seek to apply. You don't have to be a Rat to be rational, anyone who has made a budget is trying to be rational.
But they're not (necessarily) being Rationalist, or following Enlightenment principles.
Which is fine. I'm not contesting that. As I said, you don't have to be a card-carrying rationalist to be rational. They just think it's a topic worth formal analysis.
I'm pretty sure that the first time I heard it, I was but wee little lad playing with my toys in the living room, overhearing what my parents were watching on the TV, and some talking heads dropping the phrase in the context of divorce. I doubt they got it from Rationalists.
"Irreconcilable differences" is a phrase that's been around for a while, with the most obvious application being in a legal context. The values bit is a rationalist shibboleth.
Gemini has an enormous context length. I think 2.0 Pro could accept 2 million tokens, and 2.5 Pro is a regression back to a million. That's still way ahead of the competition.
There's very little I personally need to do that requires CLs that long. I think the most I've ever used in a natural manner was when I was translating a novel and used up 250k tokens. I've used more for the hell of it, but never actually maxed it out before I got bored.
Models are also getting better at making good use of the additional context, but there's still performance degradation that's not captured by benchmarks like needle in a haystack tests.
Give Gemini a go, preferably through AI Studio. I think it'll do much better than Claude. I know some documentation in programming can be enormous, but probably not 400k words enormous? You could alternatively enable search if the docs are publicly available on the web.
This happens because while the model has been programmed by some clever sod to apologize when told that it is wrong, it doesn't actually have a concept of "right" or "wrong". Just tokens with with different correlation scores
This is wrong! It would have been a reasonable claim to make a few years back, but we know for a fact this isn't true now:
https://www.anthropic.com/research/tracing-thoughts-language-model
It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is "on" by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing "known entities" activates and inhibits this default circuit (see also this recent paper for related findings). This allows Claude to answer the question when it knows the answer. In contrast, when asked about an unknown entity ("Michael Batkin"), it declines to answer.
There was other relevant work which shows that models, if asked if they're hallucinating, can usually find such errors. They very much have an idea of true versus false, to deny that would be to deny the same for humans, since we ourselves confabulate or can be plain old wrong.
Unless you explicitly tell/programm it to exclude the specific mistake/mistakes that it made from future itterations (a feature typically unavailable in current LLMs without a premium account) it will not only continue to make but "double down" on those mistakes because whatever most correlates with the training data must, by definition, be correct.
Gemini 2.5 Pro Thinking in particular is far more amenable to reason. It doesn't normally double down and will accept correction. At least ChatGPT has the option to add memories about the user, so you can save preferences or tell it to act differently.
I'm slightly disappointed to catch it hallucinating, which is why I went to this much trouble instead of just accepting that as a fact the moment someone contested it. It's still well ahead of the rest.
Ah... I get it now. Thank you! I'm disappointed to see hallucination and confabulation here, but it you're inclined, do keep trying out Gemini 2.5 Pro Thinking in particular. It's a good model.
Hmm. I think that's likely because my prompt heavily encouraged it to reason and calculate from first principles. It's a good thing that it noted that those attempts didn't align with pre-existing knowledge, and accurately recalled the relevant values, which must be a nigh-negligible amount of the training data.
At the end of the day, what matters is whether the model outputs the correct answer. It doesn't particularly matter to the end user if it came up with everything de-novo, remembered the correct answer, or looked it up. I'm not saying this can't matter at all, but if you asked me or 99.999% of the population to start off trying to answer this problem from memory, we'd be rather screwed.
Thanks for the suggestion and looking through the answer, I've personally run up to the limits of my own competence, and there are few things I can ask an LLM to do that I can't, while still verifying the answer myself.
-
LessWrong lead the charge on even considering the possibility of AI going badly, and that this was a concern to be taken seriously. The raison d'être for both OpenAI (initially founded as a non-profit to safely develop AGI) and especially Anthropic (founded by former OpenAI leaders explicitly concerned about the safety trajectory of large AI models). The idea that AGI is plausible, potentially near, and extremely dangerous was a core tenet that in those circles.
-
Anthropic in particular is basically Rats/EAs, the company. Dario himself, Chris Olah, a whole bunch of others.
-
OAI's initial foundation as a non-profit was using funds from Open Philanthropy, an EA/Rat charitable foundation. They received about $30 million, which meant something in the field of AI back in the ancient days of 2017. SBF, notorious as he is, was at the very least a self-proclaimed EA and invested a large sum in Anthropic. Dustin Moskovitz, the primary funder for Open Phil, lead initial investment into Anthropic. Anthropic President Daniela Amodei is married to former Open Philanthropy CEO Holden Karnofsky; Anthropic CEO Dario Amodei is her brother and was previously an advisor to Open Phil.
As for Open Phil itself, the best way to summarize is: Rationalist Community -> Influenced -> Effective Altruism Movement -> Directly Inspired/Created -> GiveWell & Good Ventures Partnership -> Became -> Open Philanthropy.
Note that I'm not claiming that Rationalists deserve all the credit for modern AI. Yet a claim that the link between them is as tenuous as that between ice cream and drowning is farcical. Any study of the aetiogenesis of the field that ignores Rat influence is fatally flawed.
I copied your comment, and it insisted it was correct. I then shared the image, and it seems to think that the issue is imprecise terminology on its part rather than an actual error.
Here's the initial response:
After putting the image in:
The important bit:
The proof never claims $r_i = n^{k-i-1}$. It uses $r_i = n^{k-i-1} \pmod{2^{i+1}}$ and the derived property $r_i \ge 1$.
Conclusion: The confusion likely arises from either the slightly ambiguous notation in the highlighted sentence (which should explicitly state "fractional part of ... is ...") or a misreading of the later step where the lower bound $r_i \ge 1$ is applied. The mathematical logic itself appears sound.
Thank you.
What do you mean by "reference NIST"? I think I've already mentioned that despite its internal chain of thought claiming to reference NIST or "look up" sources, it's not actually doing that. It had no access to the internet. I bet that's an artifact of the way it was trained, and regardless, the COT, while useful, isn't a perfect rendition of inner cognition. When challenged, it apologizes for misleading the user, and says that it was a loose way of saying that it was wracking its brains and trying to find the answer in the enormous amount of latent knowledge it possesses.
I also find it very interesting that the model that couldn't use code to run its calculations got a very similar answer. It did an enormous amount of algebra and arithmetic, and there was every opportunity for hallucinations or errors to kick in.
I don't follow the AI developments terribly closely, and I'm probably missing a few IQ points to be able to read all the latest papers on the subjects like Dase does, so I could be misremembering / misunderstanding something, but from what I heard capital 'R' Rationalism has had very little to do with it, beyond maybe inspiring some of the actual researchers and business leaders.
Yudkowsky himself? He's best described as an educator and popularizer. He's hasn't done much in terms of practical applications, beyond founding MIRI, which is a bit player. But right now, leaders of AI labs use rationalist shibboleths, and some high ranking researchers like Neel Nanda, Paul Christiano and Jan Leke (and Ryan Moulton too, he's got an account here to boot) are all active users on LessWrong.
The gist of it is that the founders and early joiners of the big AI labs were strongly motivated by their beliefs in the feasibility of creating superhuman AGI, and also their concern that there would be a far worse outcome if someone else, who wasn't as keyed into concerns about misalignment was the first to go through.
As for building god, I think I heard that story before, and I believe it's proper ending involves striking the GPU cluster with a warhammer, followed by several strikes with a shortsword. Memes aside, it's a horrible idea, and if it's successful it will inevitably be used to enslave us
You'll find that members of the Rationalist community are more likely to share said beliefs than the average population.
Yud had a whole institute devoted to studying AI, and he came up with nothing practical. From what I heard, the way the current batch of AIs work has nothing to do it with what he was predicting, he just went "ah yes, this is exactly what I've been talking about all these years" after the fact.
Yudkowsky is still more correct than 99.9999% of the global population. He did better than most computer scientists and the few ML researchers around then. He correctly pointed out that you couldn't just expect that a machine intelligence would come out following human values (he also said that it would understand them very well, it just wouldn't care, it's not a malicious or naive genie). Was he right about the specifics, such as neural networks and the Transformer architecture that blew this wide open? He didn't even consider them, but almost nobody really did, until they began to unexpectedly show promise.
I repeat, just predicting that AI would reach near-human intelligence (not that they're not already superintelligent in narrow domains) before modern ML is a big deal. He's on track when it comes to being right that they won't stop there, human parity is not some impossible barrier to breach. Even things like recursive self-improvement are borne out by things like synthetic data and teacher-student distillation actually working well.
In any case when I bring up rationalism's failure, I usually mean it's broader promises of transcending tribalism, systematized winning, raising the sanity waterline, and making sense of the world. In all of these, it has failed utterly.
Anyone who does really well in a consistent manner is being rational in a way that matters. There are plenty of superforecasters and Quant nerds who make bank on being smarter and more rational given available information than the rest of us. They just don't write as many blog posts. They're still applying the same principles.
Making sense of the world? The world makes pretty good sense all considered.
It makes sense, because my feelings toward rationalism and transhumanism are quite similar. Irreconcilable value differences are irreconcilable, though funnily enough mist transhumanists, yourself included, seem like decent blokes.
Goes both ways. I'm sure you're someone I can talk to over a beer, even if we vehemently disagree on values.
(The precise phrase "irreconcilable values difference" is a Rationalist one, it's in the very air we breathe, we've adopted their lingo)
I asked Gemini 2.5 Pro Thinking to solve it. It claimed to have a solution. I asked for the most concise summary it could provide'
Okay, here's a concise summary of the proof, avoiding technical jargon:
The Problem: Imagine you have a way of writing numbers that isn't based on powers of 10 (like our usual 1s, 10s, 100s), but on powers of $2n$, where $n$ is some odd number (like powers of 6, if $n=3$, or powers of 10, if $n=5$). We look at the number $n^k$ ($n$ multiplied by itself $k$ times). We want to show that if you pick $n$ to be a really large odd number, then when you write $n^k$ in this "base $2n$" system, all the "digits" you use will be bigger than some target value $d$.
How the Proof Works:
- Figuring out the Digits: The proof first finds a way to calculate the digits used to write $n^k$ in the base $2n$ system. Let's call the digits $c_0, c_1, c_2, \dots$.
- The Last Digit ($c_0$): It turns out the very last digit ($c_0$, the "ones" place in this system) is simply $n$ itself. So, to make this digit bigger than $d$, we just need $n$ to be bigger than $d$.
- The Other Digits: The proof finds a formula for the other digits ($c_1, c_2,$ etc.). It shows that each digit $c_i$ is calculated by taking $n$, multiplying it by some specific odd number (which depends on $k$ and which digit it is), dividing by a power of 2 (like 2, 4, 8, ...), and then rounding down.
- Making Digits Large: Since the odd number multiplier is at least 1, each digit $c_i$ is at least roughly $n$ divided by some power of 2 (like $n/1, n/2, n/4, \dots, n/2^{k-1}$).
- The Threshold: If we want all digits to be bigger than $d$, we need even the smallest of these estimates (which involves dividing $n$ by the largest power of 2, $2^{k-1}$) to be bigger than $d$. The proof calculates that if $n$ is larger than a specific value $N$ (which depends on $k$ and $d$, specifically $N = 2^{k-1}(d+1)$), then every digit is guaranteed to be greater than $d$.
- Conclusion: Therefore, such a threshold $N$ exists. Pick any odd number $n$ larger than this $N$, and all the digits of $n^k$ in base $2n$ will be larger than $d$.
Was it Induction?
The main argument of the proof is not primarily by induction. It's a direct proof that calculates the digits and then finds a condition on $n$ (making it large enough) to satisfy the requirement.
However, induction was used for one small intermediate step: proving that $n^k$ always leaves a remainder of $n$ when divided by $2n$ (which showed that the last digit $c_0$ is indeed $n$). But the overall structure connecting the size of $n$ to the size of all digits didn't rely on induction.
Here's the raw answer (minus reasoning trace):
Absolutely. It's the revealed preference of many a child to labor away in the virtual caves of Minecraft. What other choice do they have, when adults are so dull and near-sighted? Hook them up with a VR setup that controls real diggers! Send them to the mines, that's what the minors miners yearn for.
When most people use the term "utilitarianism", they're talking about the Benthamian or Springer notion. This is a mistake I've made myself, having argued with some poor guy on the old Motte where I claimed that since I have a utility function, I'm therefore utilitarian. I've learned from that error.
My understanding is that most humans aren't VNM rational! They violate one or more of the different requirements, in the sense that their preferences can be contradictory. An example is the Allais Paradox. I don't know if any human is actually VNM rational, but I don't think that's necessarily impossible for someone who is good at meta-cognition and math.
Note that I'm not disagreeing with Yudkowsky here, I was aiming to demonstrate that @Primaprimaprima 's (implicit, by my understanding) claim that not being a utilitarian disqualified him from being a "Yudkowskian Rationalist".
As an aside, this is where I most differ from Yudkowsky on the current race to AGI: he seems to think we're now extra-doomed because we don't even fully understand the AIs we're creating; I think we're now fractionally-doomed for the same reason. The contrapositive of "a utility function simple enough to understand is unsafe" is "a safe utility function is something we won't fully understand". I don't know if stochastic descent + fine-tuning for consistency will actually derive a tolerably human value system starting from human text/audio/video corpuses, but it's at least possible.
I disagree with Yud on this myself. My p(doom) has gone down from a max of 70% to a far less concerning 20% these days. Our alignment techniques, while imperfect, produce LLMs which are remarkably in-sync with the goals and desires of their creators (and to a lesser extent, their users). Anthropic is doing excellent mechanical interpretability work, such as recent studies into how Claude actually thinks (it's not just predicting the next token, it backtracks and "thinks ahead). They're not entirely black boxes, as was feared to be the case before modern LLMs arrived.
It's also remarkable that RLHF works, and I'm confident that Yudkowsky was surprised by this, even if his priors didn't update that much (I recall a Twitter post along these lines). I was surprised, I remember thinking, holy shit, this works??
Note that just because a model is aligned with its creators/users, that doesn't mean that it's aligned with me. Consider the possibility that a Chinese AGI follows orders exactly understanding the CCP's intent, but said orders are to permanently disempower all non-Chinese and wrest control of the light cone (casualties are acceptable).
I will note that Gemini 2.0 Flash and GPT-4o are significantly behind the SOTA! The latter got a very recent update that made it the second best model on LM Arena, but they're both decidedly inferior in reasoning tasks compared to o1, o3 or Gemini 2.5 Pro Thinking. (Many caveats apply, since o1 and o3 have different sub-models and reasoning levels)
I asked two instances of Gemini 2.5 Pro:
Number 1:
What is the decay time for the 3p-1s transition in hydrogen? Make sure you are certain about your answer, after doing the relevant calculations.
Final answer: 5.27 ns
Second iteration:
What is the decay time for the 3p-1s transition in hydrogen? Make sure you are certain about your answer, after doing the relevant calculations. I have enabled code execution, if you think that would help with the maths.
Final answer: 5.28 ns
I wasn't lying to it, I'd enabled its ability to generate and execute code. Neither instance had access to Google Search, which is an option I could toggle. I made sure it was off. If you read the traces closely, you see mention of "searching the NIST values", but on being challenged, the model says that it wasn't looking it up, but trying to jog its own memory. This is almost certainly true.
I've linked to dumps of the entire reasoning trace and "final" answer:
First instance- https://rentry.org/cqty47r2
Second instance- https://rentry.org/2oyx24sa
I certainly don't know the answer myself, so I used GPT-4o with search enabled to evaluate the correctness of the answer. It claimed that both were excellent, and the correct value is around 5.4 ns according to experimental results (the decay time for the hydrogen 3p state).
I also used plain old Google, but didn't find a clear answer. There might be one in: https://link.springer.com/article/10.1007/s12043-018-1648-4?
But it's pay walled. I don't know if ChatGPT GPT-4o was able to access it despite this impediment.
Edit:
DeepSeek R1 without search claimed 1.2e-10 seconds. o3-mini without search claims 21 ns.
- Prev
- Next
https://blog.google/products/gemini/how-we-built-gemini-robotics/
This would be a strong contender, alongside Boston Dynamics and their Chinese counterparts. The capabilities demonstrated here are staggering, they just need to get a little faster and cheaper.
More options
Context Copy link