official_techsupport's profile

official_techsupport who/whom 4mo ago · Edited 4mo ago

Grok 3.0 apparently can listen to music and integrate the results with the rest of its knowledge.

Twitter released their newest iteration on AI, it's fun, clever, and noticeably less lobotomized than basically everything before it. It's currently free with a basic twitter account. But I accidentally discovered a thing that kinda blew my mind. I got into a silly argument with it, as one does, about whether you can legitimately pronounce "gambler" with three syllables. As one of my arguments I brought up a Johnny Cash song where he, to my ear, does it. The robot responded:

In Johnny Cash’s version, recorded in 2003 and released posthumously, he pronounces both "gambler" and "rambler" with two syllables. Listening to the track, particularly in the chorus lines "Tell the rambler, the gambler, the back biter," Cash delivers them as "ram-bler" and "gam-bler." His deep, steady vocal style and the song’s rhythmic structure—driven by a stomping beat—support this two-syllable phrasing. There’s no elongation into a three-syllable "gam-buh-lur" or "ram-buh-lur" as you suggested earlier with "gambler." Cash’s Southern accent is present but doesn’t stretch the words beyond their standard two-syllable form here.

For reference, you can hear it clearly around the 0:45 mark in the official recording. Here’s a link to the track on YouTube: Johnny Cash - God's Gonna Cut You Down. Check it out and let me know if you hear something different!

I disagree obviously, but notice the timestamp!!! I'm reasonably sure that nobody in the history of the internet had this exact argument before and mentioned the exact timestamp in that exact song. Moreover, before that I asked it about "House of the Raising Sun" (because I misremembered the vocalist drawling "gambling" there) and the robot also timestamped the place in the recording where it was said.

So I don't know. It's possible that this is a result of an unsophisticated hack, give the AI a database of timestamped subtitles for youtube videos (something they already have generated), then it bullshits its way through the argument about what was actually said and how. That's totally possible, it's really good at bullshitting!

The other possibility is that it actually listens to videos/audios and analyses them on the fly or during training, whatever. What's super interesting about is that, look, we started with LLMs that literally had not a single real world reference, nothing that could remotely qualify as a qualia of say seeing an apple. They were trained entirely on people talking about their perceptions of apples, and somehow they managed to learn what apples are pretty well, without ever seeing one (which all philosophers agreed should be impossible, seeing apples must come first, and yet here we were). And now, if it's not just a subtitle hack, then we have quietly passed another milestone, the robots now can hear and see and correlate that with their knowledge bases.

Also, I asked the robot directly:

Do you have timestamped subtitles for Johnny Cash - God's Gonna Cut You Down?

I don’t have direct access to pre-made timestamped subtitles for Johnny Cash’s "God’s Gonna Cut You Down" from American V: A Hundred Highways (2006), but I can create them for you based on the official recording! I’ll use the widely available version on YouTube (Johnny Cash - God's Gonna Cut You Down), which runs 2:39, and break down the lyrics with approximate timestamps. This is the stripped-down, acoustic rendition produced by Rick Rubin, and I’ll align the subtitles to Cash’s vocal delivery.

Here are the timestamped subtitles:

(timestamped subtitles followed)

Idk, it responded pretty much instantly, so it could be lying. Or maybe it has preprocessed subtitles for popular videos.

10

Context

official_techsupport who/whom 4mo ago

I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes?

It certainly has access to the subtitles so they are probably cached at least. That video has like 10 million views, while I can't believe that I was not the first person questioning the number of syllables in "gambeler", it definitely could have pre-subtitled it.

Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.

Thank you!

Yeah, after showing this to people and thinking about it, I lean heavily towards Grok having been fed a bunch of autogenerated subtitles (with timestamps). Which is very cool but not all as cool as if it actually listened to stuff. Also, then it keeps hallucinating stuff.

2

Context

official_techsupport who/whom 4mo ago

As an aside, surely you must have more intelligent things to do with your time than arguing with chatbots.

Like reading replies like this? Arguing with a chatbot was more productive tbhwyf.

1

Context

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats

official_techsupport

official_techsupport