now I'm a bit miffed about the exorbitant rent.
The Invisible Hand doesn't sleep.
I also argued that implementing this feature wouldn't require Grok developers to do anything special
This part is wrong -- Grok is designed to take text input, and the developers would definitely need to 'do something' for it to ingest youtube audio instead. (and further work would be required for the model to make any judgement as to how Cash was pronouncing "gambler")
My purpose here is to clear your misconceptions as to how the technology works, and what is possible.
I never said that it was impossible to build a near-instantaneous transcription service -- I said that Grok has no reason to do so, and therefore almost certainly didn't.
youtube will serve all necessary files at the same time
I don't think it will -- have you interacted with the youtube API at all?
I'm guessing you never had to deal with business decisions before.
That's a really bad guess -- I've even had to deal with people who don't want to (and/or seem unlikely to be able to!) pay my business for services rendered.
I'm not asking about whether you think it was a good idea to deliver the coal or not -- I'm saying that, legally speaking -- the (alleged) buyer was seeking to enforce a contract for ongoing delivery that they appear to have already violated by not paying the late fees that were laid out in the contract. My understanding of these situations is that once a party fails to fulfill some part of their responsibilities under a contract, the counterparty is not no longer bound by the rest of it.
Now they have to find someone else to buy 12,000 tons of metallurgical coal per week
I thought the point of the case was that the coal company didn't want to deliver the coal, and the buyer wanted it -- the business decision seems to have been made?
Sounds are not text though -- nothing is free, and nothing is instant.
Why don't you try it? Ask Grok to transcribe a song from a youtube link and see what it does -- preferably a song that differs from the published lyrics somehow, maybe a live version or something.
I don't think so -- you are wrong on this one.
OK, but will Grok? I guess it would be pretty easy to try, but it might refuse on copyright grounds or something.
whisper_print_timings: total time = 67538.11 ms
OK? 67 seconds is not instant -- like, at all. Even 6.7s (assuming the resources assigned to this task were as you suggest) is not instant.
I'm not arguing that Grok 3.0 does in fact do all of this with the Johnny Cash song. All I'm saying is that it could.
Of course it could! But it doesn't, and the fact that it responded instantly is evidence of that. Do you really think Grok is spending resources (mostly dev time, really) to add features allowing the model to answer weird questions about song lyrics?
LLMs lie man -- we should get used to it I guess.
Twitter's not technically FAANG, but I think they need to compete with those salaries -- for which (especially in the Bay Area) $300K is nowhere near top-end.
Stock grant of that much again would also be nothing special for somebody at all in demand -- so $.5-1M TC sounds about right.
LLMs were developed as tools to automatically generate transcripts and sub-titles
Interesting assertion, but it doesn't really have any bearing on whether or not Grok can do this -- it takes text input from the user, and generates a text response. What makes you think it even has an interface to bring in audio inputs? (on the training end, they might -- given the hunger for data -- but it seems like an odd thing to include in a chatbot. Even for training, it would probably be better to do something like, oh, IDK -- run a transcripting algo on as much YouTube content as you can grab and then feed the text from that into your training set. You might even include some timestamps!)
Yes, but serving and parsing videos from youtube is not one of those things.
Warren had never once paid "on time" but had waited until the last minute and withheld the late fee.
How come they hadn't repudiated the contract if they didn't pay the late fees?
IDK, some of those Joe Biden "get in, loser" memes were pretty funny.
LLMs != AI.
Agreed!
(that means that there is no AI at all though -- and the sheer effort/$ being devoted to LLMs is if anything making it less likely that there will be anytime soon.)
Yes, and for the LLM to parse these bits, first youtube needs to locate them, then serve them to the llm. If the llm can convince youtube to serve the bits as fast as bandwidth will allow, it still needs to run those bits through some transcription algo -- which typically are borderline on lagging at 1x speed.
In the instant case, it would also need that algo to make some sort of judgement on the accent with which some of the words are being pronounced -- which is not a thing that I've seen. The fact that it goes ahead and gets this wrong (Cash pretty clearly says gam-bel-er in the video) makes it much more likely that the llm is looking at some timestamped transcript to pick up the word "gambler" in the context of country songs, and hallucinating a pronunciation.
This would be more convincing if humanoid robots existed -- or llms were able to control them. If you ask an LLM "how do you break down a chicken?" it will probably give you a pretty good description that a human could follow -- this sort of thing is well represented in its training set. If you ask it for a program to activate the servos of a hypothetical knife-wielding humanoid robot such that a chicken if front of it will be disassembled, it will give you utter trash. (if it doesn't demur)
It's a pretty good example of the difference between an intelligence and language model actually -- a language model can describe things, and AI can do things.
All that to say, if you want your chicken factory automated, waiting for a humanoid robot so you can drop it into place is not a very effective approach. Buying some machines from the Dutch would work much better.
True (and interesting about the Chrome extension; what is the usecase for 10x browser playback of youtube videos, I wonder?) but I'm quite sure Grok is not currently programmed with anything like this.
I do understand this -- just the same, 'parsing bits' from a video file does not happen instantly. Indeed, just starting a stream on youtube is typically not what I would call 'instant'.
OK, then you need an audio analysis model -- this is not a thing that is integrated into LLMs.
The Dutch company video somebody linked downthread shows it done with rotating knives and alignment guides, not robots at all -- which seems to work, and is not at all generalist.
The approach I'm imagining involves laying the carcass out flat on a cutting board, holding it with the robot hooks, and slicing off limbs based on the location of the joints as determined by AI(tm). Probably another stage for de-breasting is needed -- or the hooks could take another bite or something.
I don't really claim that this would work well; certainly not better than the machine in the video -- but it would work better than some non-existent humanoid robot attached to a non-existent AGI.
Any of this would need a pretty specialized video analysis module though, which AFAIK doesn't really exist period, much less built into Grok -- plus the ability to download the video directly rather than look at a stream of it, which Youtube doesn't really provide. So if the AI were literally accessing the video through that link, 3:00/2x is indeed the fastest it would be able to provide the transcript.
(it would not be instant in any case; downloading the video takes X seconds, analyzing it Y -- X + Y might be less than three minutes, but it's not less than one second)
A humanoid robot is not the right tool for the job though -- what you want is a machine with sharp knives matching the number of joints on a chicken mounted to some kind of press, plus several hooks that can grab the carcass and align it appropriately. (the knives probably need to self-adjust too, depending on the size-consistency of you chickens)
Machine vision probably helps with this some, but as others have said "object segmentation" was a pretty solved problem years ago -- and there's no AI anywhere close to performing at the "I need you to cut this chicken apart at the joints, m'kay" level on the forseeable horizon.
There's a reason why welding bots are not humanoid form -- humans are generalists, bots are not.
(timestamped subtitles followed)
Idk, it responded pretty much instantly, so it could be lying. Or maybe it has preprocessed subtitles for popular videos.
I don't see how it could possibly generate subtitles instantly on the fly for a music video with a runtime of three minutes? Also, listening to the track it seems like a pretty good example of the pronounciation that you are referring to -- so it's clearly not 'listening' to the video in any meaningful way.
"AI lies and confidently misrepresents evidence in order to advance it's chosen position" is... not too surprising considering that it's been trained on decades of internet fora conversations, but probably not the kind of alignment we are looking for.
Chess notation has a lot of useful nuance here -- She(?), She(!), She(!?), and so on. Hard to verbalize though; maybe some crossover with the African tongue clicks that are similarly notated?
More options
Context Copy link