official_techsupport
who/whom
No bio...
User ID: 122

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes?
It certainly has access to the subtitles so they are probably cached at least. That video has like 10 million views, while I can't believe that I was not the first person questioning the number of syllables in "gambeler", it definitely could have pre-subtitled it.
Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.
Thank you!
Yeah, after showing this to people and thinking about it, I lean heavily towards Grok having been fed a bunch of autogenerated subtitles (with timestamps). Which is very cool but not all as cool as if it actually listened to stuff. Also, then it keeps hallucinating stuff.
As an aside, surely you must have more intelligent things to do with your time than arguing with chatbots.
Like reading replies like this? Arguing with a chatbot was more productive tbhwyf.
- Prev
- Next
Grok 3.0 apparently can listen to music and integrate the results with the rest of its knowledge.
Twitter released their newest iteration on AI, it's fun, clever, and noticeably less lobotomized than basically everything before it. It's currently free with a basic twitter account. But I accidentally discovered a thing that kinda blew my mind. I got into a silly argument with it, as one does, about whether you can legitimately pronounce "gambler" with three syllables. As one of my arguments I brought up a Johnny Cash song where he, to my ear, does it. The robot responded:
I disagree obviously, but notice the timestamp!!! I'm reasonably sure that nobody in the history of the internet had this exact argument before and mentioned the exact timestamp in that exact song. Moreover, before that I asked it about "House of the Raising Sun" (because I misremembered the vocalist drawling "gambling" there) and the robot also timestamped the place in the recording where it was said.
So I don't know. It's possible that this is a result of an unsophisticated hack, give the AI a database of timestamped subtitles for youtube videos (something they already have generated), then it bullshits its way through the argument about what was actually said and how. That's totally possible, it's really good at bullshitting!
The other possibility is that it actually listens to videos/audios and analyses them on the fly or during training, whatever. What's super interesting about is that, look, we started with LLMs that literally had not a single real world reference, nothing that could remotely qualify as a qualia of say seeing an apple. They were trained entirely on people talking about their perceptions of apples, and somehow they managed to learn what apples are pretty well, without ever seeing one (which all philosophers agreed should be impossible, seeing apples must come first, and yet here we were). And now, if it's not just a subtitle hack, then we have quietly passed another milestone, the robots now can hear and see and correlate that with their knowledge bases.
Also, I asked the robot directly:
(timestamped subtitles followed)
Idk, it responded pretty much instantly, so it could be lying. Or maybe it has preprocessed subtitles for popular videos.
More options
Context Copy link