Well, as a general rule, I don't do business with people who lie to me.
You've got a business, and you sent me junk mail, but you made it look like some official government thing to get me to open it? I'm done, just because you lied on the envelope. I don't care how badly I need your service. There's a dozen other places that can provide it; I'll pick one of them rather than you, because you've shown yourself to be dishonest right out of the gate.
Same thing with an AI (or a business that creates an AI). You're willing to lie about who you are (or have your tool do so)? What else are you willing to lie to me about? I don't have time in my life for that. I'm out right here.
Out of curiosity, given two code submissions that are completely identical—one written solely by a human and one assisted by AI—why should its provenance make any difference to you? Is it like fine art, where it’s important that Picasso’s hand drew it? Or is it like an instruction manual, where the author is unimportant?
Similarly, would you consider it to be dishonest if my human colleague reviewed and made changes to my code, but I didn’t explicitly credit them?
Why does the provenance make any difference? Let me increase your options. Option 1: You completely hand-wrote it. Option 2: You were assisted by an AI, but you carefully reviewed it. Option 3: You were assisted by an AI (or the AI wrote the whole thing), and you just said, "looks good, YOLO".
Even if the code is line-for-line identical, the difference is in how much trust I am willing to give the code. If I have to work in the neighborhood of that code, I need to know what degree of skepticism I should be viewing it with.
That's the thing. As someone evaluating pull requests, should you trust the code based on its provenance, or should you trust it based on its content? Automated testing can validate code, but it can't validate people.
ISTM the most efficient and objective solution is to invest in AI more on both sides of the fence.
In the future, that may be fine. We're not in that future yet. We're still at a place where I don't fully trust AI-only code to be as solid as code that is at least thoroughly reviewed by a knowledgeable human.
(Yes, I put "AI-only" and "knowledgeable" in there as weasel words. But I think that with them, it is not currently a very controversial case.)
As an attorney, I know copyright law. (This is not legal advice.) There's nothing about copyright law that says you have to credit an AI coding agent for contributing to your work. The person receiving the code has to perform their due diligence in any case to determine whether the author owns it or has permission from the owner to contribute it.
Can you back this up with legal precedence? To my knowledge, nothing of the sort has been ruled by the courts.
Additionally, this raises another big issue. A few years ago, a couple guys used software (what you could argue was a primitive AI) to generated around 70 billion unique pieces of music which amounts to essentially every piece of copyrightable music using standard music scales.
Is the fact that they used software to develop this copyrighted material relevant? If not, then their copyright should certainly be legal and every new song should pay them royalties.
It seems that using a computer to generate results MUST be added as an additional bit of analysis when it comes to infringement cases and fair use if not a more fundamental acknowledgement that computer-generated content falls under a different category (I'd imagine the real argument would be over how much of the input was human vs how much was the system).
Of course, this all sets aside the training of AI using copyrighted works. As it turns out, AI can regurgitate verbatim large sections of copyrighted works (up to 80% according to this study[0]) showing that they are in point of fact outright infringing on those copyrights. Do we blow up current AI to maintain the illusion of copyright or blow up current copyright law to preserve AI?
You're asking a lot of very good and thoughtful questions, but none are directly related to the immediate issue, which is "do I have to credit the AI model?".
You are spamming the whole fucking thread with the same nonsense. It is instructed to hide that the PR was made via Claude Code. I don't know why people who are so AI forward like yourself have such a problem with telling people that they use AI for coding/writing, it's a weirdly insecure look.
what's insecure about it? if it is up to the institution to make that decision - you can still do it. Claude is not stopping you from making that decision
Hello, the part about canonical filtering in https://openreview.net/pdf?id=DFybOGeGDS doesn't seem to try to account for pretokenization. For example, if you receive " 天天中彩票APP" in o200k, it means there has to be a lowercase letter within the span of letters, and while tokens like (4 spaces) may be pairwise compatible with tokens like "123" according to the BPE merge rules, the pretokenizer would split the span of spaces to give (3 spaces), " ", "123" instead. Are you aware of any work that does actual canonical generation for models with this kind of pretokenization regex?
If the immediate next token probabilities are flat, that would mean the LLM is not able to predict the next token with any certainty. This might happen if an LLM is thrown off by out of distribution data, though I haven't personally seen it happen with modern models, so it was mostly a sanity check. But examples from the past that would cause this have been simple things like not normalizing token boundaries in your input, trailing whitespace, etc. And sometimes using very rare tokens AKA "glitch tokens" (https://en.wikipedia.org/wiki/Glitch_token).
Hello, a couple years ago I participated in a contest to count word frequencies and generate a sorted histogram. There's a cool post about it featuring a video discussing the tricks used by some participants. https://easyperf.net/blog/2022/05/28/Performance-analysis-an...
Some other participants said that they measured 0 difference in runtime between pshufb+eq and eqx3+orx2, but i think your problem has more classes of whitespace, and for the histogram problem, considerations about how to hash all the words in a chunk of the input dominate considerations about how to obtain the bitmasks of word-start or word-end positions.
requires fully deterministic inference, which turns out to be unusual, but for this sort of thing it's probably fine if you do really slow inference on cpu. cool idea.
reply