A training dataset is a document, not a method of processing a document. This type of document regularly gets reproduced and distributed in a commercial environment. Even if the distribution is contained within a large corporation, it still counts as distribution. Should that be allowed within the scope of copyright law? This seems like a legitimate question.
Google's AI summaries are actively harming quite a lot of people. They're regularly filled with misinformation, but they're presented as facts, complete with references. Many people do not understand the limitations of this technology, and simply believe what they're presented.
I'm not convinced that Google understands the limitations, to be honest. The most charitable interpretation I can give of their motivations is that they're terrified of competition from OpenAI, and are trying to present an alternative. Unfortunately, they're presenting a woefully inadequate product.
It goes further though, into legitimate questions of copyright, which the tech industry has always fought against. (Take first, deal with it later is the MO.)
Some schools are $100k/year for room, board, and tuition, and yet those expensive schools are very much optional. It's a red herring to point them out.
There are still affordable schools. And staying in a dorm with expensive room and board remains optional at many institutions. Heck, some people still live with their parents.
The state school I went to is still just around $10k/year tuition, and I got a broad education that opened many doors for me. (I was in the humanities, but there are very good science programs there as well.)
Of course it's crazy to sink $400k into a degree for most people. And for many, many people, it is completely un-necessary! You can still get a relatively affordable 4 year degree.
Yeah, all of the above was a single bug in the plot allocation code, the exception that handled the transaction rollback had the wrong name. It's working again.
So much of what makes people willing to be moved by creative art is the willingness to believe they're investing in someone else's real thoughts & effort -- and opening themselves to a channel of real human connection & relationship.
AI has raised the bar, in terms of making it more difficult to create the trust necessary for people to be willing to open themselves up to that connection.
Automatic captions has been transformative, in terms of accessibility, and seems to be something people universally want. Most people don't think of it as AI though, even when it is LLM software creating the captions. There are many more ways that AI tools could be embedded "invisibly" into our day-to-day lives, and I expect they will be.
To be clear, it's not LLMs creating the captions. Whisper[0], one of the best of its kind currently, is a speech recognition model, not a large language model. It's trained on audio, not text, and it can run on your mobile phone.
It's still AI, of course. But there is distinction between it and an LLM.
It’s an encoder-decoder transformer trained on audio (language?) and transcription.
Seems kinda weird for it not to meet the definition in a tautological way even if it’s not the typical sense or doesn’t tend to be used for autoregressive token generation?
Whisper is an encoder decoder transformer. The input is audio spectrograms, the output is text tokens. It is an improvement over old school transcription methods because it’s trained on audio transcripts, so it makes contextually plausible predictions.
Idk what the definition of an LLM is but it’s indisputable that the technology behind whisper is a close cousin to text decoders like gpt. Imo the more important question is how these things are used in the UX. Decoders don’t have to be annoying, that is a product choice.
Do you have an example of a good implementation of ai captions? I've only experienced those on youtube, and they are really bad. The automatic dubbing is even worse, but still.
On second thought this probably depends on the caption language.
I'm not going to defend the youtube captions as good, but even still, I find them incredibly helpful. My hearing is fine, but my processing is rubbish, and having a visual aid to help contextualize the sound is a big help, even when they're a bit wrong.
Your point about the caption language is probably right though. It's worse with jargon or proper names, and worse with non-American English speakers. If we they don't even get right all the common accents of English, I have little hope for other languages.
Automatic translation famously fails catastrophically with Japanese, because it's a language that heavily depends on implied rather than explicit context.
The minimal grammatically correct sentence is simply a verb, and it's an exercise to the reader to know what the subject and object are expected to be. (Essentially, the more formal/polite you get, the more things are added. You could say "kore wa atsu desu" to mean "this is hot." But you could also just say "atsu," which could also be interpreted as a question instead of a statement.)
Chinese seems to have similar issues, but I know less about how it's structured.
Anyway, it's really nice when Japanese music on YouTube includes a human-provided translation as captions. Automated ones are useless, when it doesn't give up entirely.
I assume people talk about transcription, not translation. Translation in youtube ime is indeed horrible in all languages I have tried, but transcription in english is good enough to be useful. However, the more technical jargon a video uses, the worse transcription is (translation is totally useless in anything technical there).
Automatic transcription in English heavily depend on accent, sound quality, and how well the speaker is articulating. It will often mistake words that sound alike to make non-sensible sentences, randomly skip words, or just inserts random words for no clear reason.
It does seem to do a few clever things. For lyrics it seem to first look for existing transcribed lyrics before making their own guesses (Timing however can be quite bad when it does this). Outside of that, AI transcribed videos is like an alien who has read a book on a dead language and is transcribing based on what the book say that the word should sound like phonetically. At times that can be good enough.
(A note on sound quality. It not the perceived quality. Many low res videos has perfectly acceptable, if somewhat lossy sound quality, but the transcriber goes insane. It likes prefer 1080p videos with what I assume much higher bit-rate for the sound.)
In the times I have noticed the transcription be bad, my speech comprehension itself is even worse. So I still find it useful. It is not substitution for human created (or at least curated) subtitles by any means, but better than nothing.
Do you have an example? YT captions being useless is a common trope I keep seeing on reddit that is not reflected in my experience at all. Feels like another "omg so bad" hyperbole that people just dogpile on, but would love to be proven wrong.
There are projects that will run Whisper or another transcription service locally on your computer, which has great quality. For whatever reason, Google chooses not to use their highest quality transcription models on YouTube, maybe due to cost.
I use Whisper running locally for automated transcription of many hours of audio on a daily basis.
For the most part, Whisper does much better than stuff I've tried in the past like Vosk. That said, it makes a somewhat annoying error that I never really experienced with others.
When the audio is low quality for a moment, it might misinterpret a word. That's fine, any speech recognition system will do that. The problem with Whisper is that the misinterpreted word can affect the next word, or several words. It's trying to align the next bits of audio syntactically with the mistaken word.
Older systems, you'd get a nonsense word where the noise was but the rest of the transcription would be unaffected. With Whisper, you may get a series of words that completely diverges from the audio. I can look at the start of the divergence and recognize the phonetic similarity that created the initial error. The following words may not be phonetically close to the audio at all.
Ah yes, one of the standard replies whenever anyone mentions a way that an AI thing fails: "You're still using [X]? Well of course, that's not state of the art, you should be using [Y]."
You don't actually state whether you believe Parakeet is susceptible to the same class of mistakes...
It's an extremely common goalpost-moving pattern on HN, and it adds little to the conversation without actually addressing how or whether the outcome would be better.
Try it, or don't. Due to the nature of generative AI, what might be an issue for me might not be an issue for you, especially if we have differing use cases, so no one can give you the answer you seek except for yourself.
I doubt that people prefer automatic capitations over human made, no more than people prefer AI subtitles. The big AI subtitle controversy going on right now in anime demonstrate well that quite a lot is lost in translation when an AI is guessing what words are most likely in a situation, compared to a human making a translation.
What people want is something that is better than nothing, and in that sense I can see how automatic captions is transformative in terms of accessibility.
Not every college has crazy tuition. The school I attended in 2000 to 2004 has kept pace with inflation generally. Annual tuition is now around $10k, which is a lot, but not unmanageable for many middle class families. I'm curious how this compares across universities throughout the U.S. Maybe the tuition story has bifurcated somewhat?
Tuition for my college in 2012-2016 was around 6k per year. A quick Google shows me this increased to 11k.
And it's worse than it looks, because this doesn't include cost of materials, the dreaded "other fees", and of course, room and board. Room and board increased 50%.
And this used to be considered a "best value" college. I'm sure it's only worse fr private schools in the state.
The state school I went to 20+ years ago, by contrast, has around $10k in annual tuition, which isn't bad compared to a trade school. No mandatory housing/food costs either. I got a great education there and am still friends with some of my profs. I also got one of the least practical (for most people) degrees (creative writing), and turned it into a comfortable job for myself, though I recognize that's the exception and not the rule.
I never thought of university as a way to get a job. It certainly did help me in many, many ways though, and can't imagine having my current career without it.
The solution is simple -- switch to another device!
Our minds are hard-wired to build habits via physical association. Having a single-purpose device very much fits with how our minds work. If we want to do research, then go to a research enabled device. If we want to focus on writing, then open the writing focused device.
reply