@SnapDragon's banner p

SnapDragon


				

				

				
0 followers   follows 0 users  
joined 2022 October 10 20:44:11 UTC
Verified Email

				

User ID: 1550

SnapDragon


				
				
				

				
0 followers   follows 0 users   joined 2022 October 10 20:44:11 UTC

					

No bio...


					

User ID: 1550

Verified Email

Well, I don't think your analogy of the Turing Test to a test for general intelligence is a good one. The reason the Turing Test is so popular is that it's a nice, objective, pass-or-fail test. Which makes it easy to apply - even if it's understood that it isn't perfectly correlated with AGI. (If you take HAL and force it to output a modem sound after every sentence it speaks, it fails the Turing Test every time, but that has nothing to do with its intelligence.)

Unfortunately we just don't have any simple definition or test for "general intelligence". You can't just ask questions across a variety of fields and declare "not intelligent" as soon as it fails one (or else humans would fail as soon as you asked them to rotate an 8-dimensional object in their head). I do agree that a proper test requires that we dynamically change the questions (so you can't just fit the AI to the test). But I think that, unavoidably, the test is going to boil down to a wishy-washy preponderance-of-evidence kind of thing. Hence everyone has their own vague definition of what "AGI" means to them; honestly, I'm fine with saying we're not there yet, but I'm also fine arguing that ChatGPT already satisfies it.

There are plenty of dynamic, "general", never-before-seen questions you can ask where ChatGPT does just fine! I do it all the time. The cherrypicking I'm referring to is, for example, the "how many Rs in strawberry" question, which is easy for us and hard for LLMs because of how they see tokens (and, also, I think humans are better at subitizing than LLMs). The fact that LLMs often get this wrong is a mark against them, but it's not iron-clad "proof" that they're not generally intelligent. (The channel AI Explained has a "Simple Bench" that I also don't really consider a proper test of AGI, because it's full of questions that are easy if you have embodied experience as a human. LLMs obviously do not.)

In the movie Phenomenon, rapidly listing mammals from A-Z is considered a sign of extreme intelligence. I can't do it without serious thought. ChatGPT does it instantly. In Bizarro ChatGPT world, somebody could write a cherrypicked blog post about how I do not have general intelligence.

FWIW, I appreciate this reply, and I'm sorry for persistently dogpiling you. We disagree (and I wrongly thought you weren't arguing in good faith), but I definitely could have done a better job of keeping it friendly. Thank you for your perspective.

Most frustratingly, the things that I actually need help on, the ones where I don't know really anything about the topic and a workable AI assistant would actually save me a ton of time, are precisely the cases where it fails hard (as in my examples where stuff doesn't even work at all).

That does sound like a real Catch-22. My queries are typically in C++/Rust/Python, which the models know backwards, forwards, and sideways. I can believe that there's still a real limit to how much an LLM can "learn" a new language/schema/API just by dumping docs into the prompt. (And I don't know anything about OpenAI's custom models, but I suspect they're just manipulating the prompt, not using RL.) And when an LLM doesn't know how to do something, there's a risk it will fake it (hallucinate). We're agreed there.

Maybe using the best models would help. Or maybe, given the speed things are improving, just try again next year. :)

What the hell? You most definitely did NOT give any evidence then. Nor in our first argument. I'm not asking so I can nitpick. I would genuinely like to see a somewhat-compact example of a modern LLM failing at code in a way that we both, as programmers, can agree "sucks".

Isn't there a case to be made for an exception here? It's not some cheap "gotcha", there's an actual relevant point to be made when you fail to spot the AI paragraph without knowing you're being tested on it. The fact that @self_made_human did catch it is interesting data! To me, it's similar to when Scott would post "the the" (broken by line breaks) at random to see who could spot it.

Right, and I asked you for evidence last time too. Is that an unreasonable request? This isn't some ephemeral value judgement we're debating; your factual claims are in direct contradiction to my experience.

Please post an example of what you claim is a "routine" failure by a modern model (2.5 Pro, o3, Claude 3.7 Sonnet). This should be easy! I want to understand how you could possibly know how to program and still believe what you're writing (unless you're just a troll, sigh).

There are plenty of tasks (e.g. speaking multiple languages) where ChatGPT exceeds the top human, too. Given how much cherrypicking the "AI is overhyped" people do, it really seems like we've actually redefined AGI to "can exceed the top human at EVERY task", which is kind of ridiculous. There's a reasonable argument that even lowly ChatGPT 3.0 was our first encounter with "general" AI, after all. You can have "general" intelligence and still, you know, fail at things. See: humans.

I'm also not allowed to use the best models for my job, so take my advice (and, well, anyone else's) with a grain of salt. Any advice you get might be outdated in 6 months anyway; the field is evolving rapidly.

I think getting AI help with a large code base is still an open problem. Context windows keep growing, but (IMO) the model isn't going to get a deep understanding of a large project just from pasting it into the prompt. Keep to smaller components; give it the relevant source files, and also lots of English context (like the headers/docs you mentioned). You can ask it design questions (like "what data structure should I use here?"), or for code reviews, or have it implement new features. (I'm not sure about large refactors - that seems risky to me, because the model's temperature could make it randomly change code that it shouldn't. Stick to output at a scale that you can personally review.)

The most important thing to remember is that an LLM's superpower is comprehension: describe what you want in the same way you would to a fellow employee, and it will always understand. It's not some weird new IDE with cryptic key commands you have to memorize. It's a tool you can (and should) talk to normally.

Let me join the chorus of voices enthusiastically agreeing with you about how jobs are already bullshit. I've never been quite sure whether this maximally cynical view is true, but it sure feels true. One white-collar worker has 10x more power to, well, do stuff than 100 years ago, but somehow we keep finding things for them to do. And so Elon can fire 80% of Twitter staff, and "somehow" Twitter continues to function normally.

With that said, I worry that this is a metastable state. Witness how thoroughly the zeitgeist of work changed after COVID - all of a sudden, in my (bullshit white-collar) industry, it's just the norm to WFH and maybe grudgingly come in to the office for a few hours 3 days a week. Prior to 2020, it was very hard to get a company to agree to let you WFH even one day a week, because they knew you'd probably spend the time much less productively. Again, "somehow" the real work that was out there still seems to get done.

If AI makes it more and more obvious that office work is now mostly just adult daycare, that lowers the transition energy even more. And we might just be waiting for another sudden societal shock to get us over that hump, and transition to a world where 95% of people are unemployed and this is considered normal. We're heading into uncharted waters.

Well, based on what I know of the Canadian indigenous peoples (who the current PC treadmill calls the "First Nations"), there's a lot of crime, misery, and unrest as a result. But hey, people addicted to videogames are less destructive than people addicted to alcohol, so we'll see.

(Also, I really don't expect to see decent neuralink tech by 2030. It's just too damn hard.)

Definitely an important point. I agree that there is a real possibility of societal breakdown under those kinds of conditions. Hopefully, even if UBI efforts never go anywhere, people will still somehow scrounge up enough to eat and immerse themselves in videogames. (We're kind of halfway there today, to be honest, judging from most of my friends.) Somehow society and the economy survived the insane COVID shutdowns (surprising me). I have hope they'll be resilient enough to survive this too. But there's no historical precedent we can point to...

Any particular reason why you're optimistic? What are your priors in regards to AI?

Same as you, I don't pretend to be able to predict an unknowable technological future and am just relying on vibes, so I'm not sure I can satisfy you with a precise answer...? I outlined why I'm so impressed with even current-day AI here. I've been working in AI-adjacent fields for 25 years, and LLMs are truly incomparable in generality to anything that has come before. AGI feels close (depending on your definition), and ASI doesn't, because most of our gains are just coming from scaling compute up (with logarithmic benefits). So it's not an existential threat, it's just a brand new transformative tech, and historically that almost always leads to a richer society.

You don't tend to get negative effects on the tech tree in a 4X game, after all. :)

Yeah, it's a good question that I can't answer. I suspect if all humans somehow held to a (not perfect but decent) standard of not driving impaired or distracted, signaling properly, and driving the speed limit or even slower in dangerous conditions ... that would probably decrease accidents by at least 80% too. So maybe self-driving cars are still worse than that.

To quote Hawkeye/Ronin: "Don't give me hope."

Waymo has a lot of data, and claims a 60-80% reduction in accidents per mile for self-driving cars. You should take it with a grain of salt, of course, but I think there are people holding them to a decent reporting standard. The real point is that even being 5x safer might not be enough for the public. Same with having an AI parse regulations/laws...

Fantastic post, thanks! Lots of stuff in there that I can agree with, though I'm a lot more optimistic than you. Those 3 questions are well stated and help to clarify points of disagreement, but (as always) reality probably doesn't factor so cleanly.

I really think almost all the meat lies in Question 1. You're joking a little with the "line goes to infinity" argument, but I think almost everyone reasonable agrees that near-future AI will plateau somehow, but there's a world of difference in where it plateaus. If it goes to ASI (say, 10x smarter than a human or better), then fine, we can argue about questions 2 and 3 (though I know this is where doomers love spending their time). Admittedly, it IS kind of wild that this this a tech where we can seriously talk about singularity and extinction as potential outcomes with actual percentage probabilities. That certainly didn't happen with the cotton gin.

There's just so much space between "as important as the smartphone" -> "as important as the internet" (which I am pretty convinced is the baseline, given current AI capabilities) -> "as important as the industrial revolution" -> "transcending physical needs". I think there's a real motte/bailey in effect, where skeptics will say "current AIs suck and will never get good enough to replace even 10% of human intellectual labour" (bailey), but when challenged with data and benchmarks, will retreat to "AIs becoming gods is sci-fi nonsense" (motte). And I think you're mixing the two somewhat, talking about AIs just becoming Very Good in the same paragraph as superintelligences consuming galaxies.

I'm not even certain assigning percentages to predictions like this really makes much sense, but just based on my interactions with LLMs, my good understanding of the tech behind them, and my experience using them at work, here are my thoughts on what the world looks like in 2030:

  • 2%: LLMs really turn out to be overhyped, attempts at getting useful work out of them have sputtered out, I have egg all over my face.
  • 18%: ChatGPT o3 turns out to be roughly at the plateau of LLM intelligence. Open-Source has caught up, the models are all 1000x cheaper to use due to chip improvements, but hallucinations and lack of common sense are still a fundamental flaw in how the LLM algorithms work. LLMs are the next Google - humans can't imagine doing stuff without a better-than-Star-Trek encyclopedic assistant available to them at all times.
  • 30%: LLMs plateau at roughly human-level reasoning and superhuman knowledge. A huge amount of work at companies is being done by LLMs (or whatever their descendant is called), but humans remain employed. The work the humans do is even more bullshit than the current status quo, but society is still structured around humans "pretending to work" and is slow to change. This is the result of "Nothing Ever Happens" colliding with a transformative technology. It really sucks for people who don't get the useless college credentials to get in the door to the useless jobs, though.
  • 40%: LLMs are just better than humans. We're in the middle of a massive realignment of almost all industries; most companies have catastrophically downsized their white-collar jobs, and embodied robots/self-driving cars are doing a great deal of blue-collar work too. A historically unprecedented number of humans are unemployable, economically useless. UBI is the biggest political issue in the world. But at least entertainment will be insanely abundant, with Hollywood-level movies and AAA-level videogames being as easy to make as Royal Road novels are now.
  • 9.5%: AI recursively accelerates AI research without hitting engineering bottlenecks (a la "AI 2027"), ASI is the new reality for us. The singularity is either here or visibly coming. Might be utopian, might be dystopian, but it's inescapable.
  • 0.5%: Yudkowsky turns out to be right (mostly by accident, because LLMs resemble the AI in his writings about as closely as they resemble Asimov's robots). We're all dead.

As far as I can tell, the vast vast VAST majority of it is slop full of repurposed music and lyrics that get by on being offensive rather than clever. Rap artists aren't known for being intelligent, after all. I suspect most "celebrated" rap music would fail a double-blind test against some rando writing parodic lyrics and banging on an audio synthesizer for a few hours. Much like postmodern art, where the janitor can't tell it apart from trash.

There probably are some examples of the genre that I could learn to appreciate (Epic Rap Battles of History comes to mind), but it's hard to find them because of the pomo effect.

Sounds like you have some practical experience here. Yeah, if just iterating doesn't help and a human has to step in to "fix" the output, then at least there'll be some effort required to bring an AI novel to market. But it does feel like detectors (even the good non-Grammarly ones) are the underdogs fighting a doomed battle.

Indeed, one of the fundamental conjectures in CS, "P != NP", can be somewhat rephrased as "it's easier to check an answer than to produce it". I think it's actually something of an optimistic view of the future that most things will end up produced with generative AI, but humans will still have a useful role in checking its work.

The thing is, the people producing the novel can use the detectors too, and iterate until the signal goes away. I have a friend who is taking some college courses that require essays, and they're explicitly told that Grammarly must not flag the essay as AI-written. Unfortunately (and somewhat amusingly), the detector sucks, and her normal writing is flagged as AI-written all the time - she has to rewrite it in a more awkward manner to get the metric below the threshold. Similarly, I imagine any given GPT detector could be defeated by just hooking it up to the GPT in a negative-feedback loop.

And interestingly, one of Douglas Adams' last projects was Starship Titanic, which was arguably the first game to have "chatbot" NPCs - in 1998! Obviously they didn't work very well, but the game's ambition was very ahead of its time.

Not my fault @SubstantialFrivolity chose to set the bar this low in his claims. An existence proof is all I need. But hey, you are fully free to replace that sarcasm with your example of how deficient ChatGPT/Claude is. Evidence is trivially easy to procure here!

Here. I picked a random easyish task I could test. It's the only prompt I tried, and ChatGPT succeeded zero-shot. (Amusingly, though I used o3, you can see from the thought process that it considered this task too simple to even need to execute the script itself, and it was right.) The code's clear, well-commented, avoids duplication, and handles multiple error cases I didn't mention. A lot of interviewees I've encountered wouldn't do nearly this well - alas, a lot of coders are not good at coding.

Ball's in your court. Tell me what is wrong with my example, or that you can do this yourself in 8 seconds. When you say "like a few lines", is that some nonstandard usage of "few" that goes up to 100?

Even better, show us a (non-proprietary, ofc) example of a task where it just "makes shit up" and provides "syntactically invalid code". With LLMs, you can show your receipts! I'm actually genuinely curious, since I haven't caught ChatGPT hallucinating code since before 4o. I'd love to know under what circumstances it still does.

You keep banging this drum. It's so divorced from the world I observe, I honestly don't know what to make of it. I know you've already declared that the Google engineers checking in LLM code are bad at their job. But you are at least aware that there are a lot of objective coding benchmarks out there, which have seen monumental progress since 3 years ago, right? You can't be completely insulated from being swayed by real-world data, or why are you even on this forum? And, just for your own sake, why not try to figure out why so many of us are having great success using LLMs, while you aren't? Maybe you're not using the right model, or are asking it to do too much (like write a large project from scratch).

I've had the "modern AI is mind-blowing" argument quite a few times here (I see you participated in this one), and I'm not really in a good state to argue cogently right now. But you did ask nicely, so I'll offer more of my perspective.

LLMs have their problems: You can get them to say stupidly wrong things sometimes. They "hallucinate" (a term I consider inaccurate, but it's stuck). They have no sense of embodied physics. The multimodal ones can't really "see" images the way we do. Mind you, just saying "gotcha" for things we're good at and they're not cuts both ways. I can't multiply 6 digit numbers in my head. Most humans can't even spell "definately" right.

But the one thing that LLMs really excel at? They genuinely comprehend language. To mirror what you said, I "do not understand" how people can have a full conversation with a modern chatbot and still think it's just parroting digested text. (It makes me suspect that many people here, um, don't try things for themselves.) You can't fake comprehension for long; real-world conversations are too rich to shortcut with statistical tricks. If I mention "Freddie Mercury teaching a class of narwhals to sing", it doesn't reply "ERROR. CONCEPT NOT FOUND." Instead there is some pattern in its billion-dimensional space that somehow fuzzily represents and works with that new concept, just like in my brain.

That already strikes me as a rather General form of Intelligence! LLMs are so much more flexible than any kind of AI we've had before. Stockfish is great at Chess. AlphaGo is great at Go. Claude is bad at Pokemon. And yet, the vital difference is that there is some feature in Claude's brain that knows it's playing Pokemon. (Important note: I'm not suggesting Claude is conscious. It almost certainly isn't.) There's work to do to scale that up to economically useful jobs (and beating the Elite Four), but it's mainly "hone this existing tool" work, not "discover a new fundamental kind of intelligence" work.