Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Claude 2 Internal API Client and CLI (github.com/explosion-scratch)
84 points by explosion-s on July 14, 2023 | hide | past | favorite | 67 comments


Who here is using Claude? And can you comment on your experiences with it vs. GPT 3.5/4?


I spent the afternoon chatting with it one day this week and had a brilliant time. I fed it half of a book I’ve written recently, a piece of narrative and descriptive non-fiction, and its analysis was absolutely great. It digested the text and found things that even human readers have missed. What was interesting was that the book is mostly genderless, and at first it gave the analysis like the writer was male. Then I said “the writer is actually a woman” and it not only apologised quite genuinely for getting it wrong, it altered its literary analysis and criticism in a way that was perfectly suited to a human reader knowing that the writer was female, and changed the slant of its analysis. It was deeply useful and interesting to converse with, and it found the relevant topics that an educated human reader would likely find interesting and comment on… and it did this in a few minutes, compared to a human reader where you’d be talking weeks of latency to read and analyse the text as a complete work.

Pretty great! Bit of a party trick at the same time (it did hallucinate a couple of minor things) but enough for me as the writer to be gripped by talking to Claude. It even came up with some really interesting questions to ask me once I told it that I was the author, and many of them were better than a lot of lazy interviewers or reviewers would come up with.

Highly recommended.


I did exactly this two nights ago. I had a dead non-fiction book in a bunch of .md files, which would never be opened again. The allure of the large context window leads me to test it with a book and Claude in the role of a critic. I almost haven't slept that night. With the help of Bard for some up-to-date references, I have managed to double the word count to 40k and restructure it while vastly improving it.


I have been using GPT4 since it was available to me a couple months ago. I use it all day.

I would concur on the quality of Claude, outstanding and the context window is utterly amazing.

Within the the first two days I already modified my workflow between Claude and GPT.

Claude and GPT4 are on par, Bard lags in quality or just flat gives up. But it is better than 3.5, what it does have going for it is speed.

Claude does seem to be more present. My hunch is that the system prompt is massive or they spent more time fine tuning it on the assistant part of the prompt. Don’t know, but a great tool. Can’t wait for API access.


What's the token limit on it? How did you feed it "half of a book," how long is the book? Did just copy pasting verbatim work or did you have to break it up into multiple messages?


That's the thing with Claude, it's not quite the same with regards to tokens.

My book is 160,000 words, so I turned the first 60,000 words into a text file and uploaded it. Then Claude just digests it like any other message.

I did run out of chat/tokens eventually though, which was actually a bit sad


YMMV, but I’ve found that interacting with Claude conversationally gives me a much stronger impression of having a productive discussion with an individual, receiving pushback on ideas that had identifiable flaws and giving advice on how to improve my own thought processes, rather than the blind obedience that GPT-4 output is so well known for. When it comes to raw problem-solving capacity GPT-4 still handily beats it, but this is the first LLM I’ve used that makes me actually regret having to swap to GPT-4 to analyze a trickier problem.


Everyone accepts output from LLMs is largely predicated on grounding them, but few seem to be realizing that grounding them applies to more than raw data.

They perform better at many tasks simply by grounding their alignment in-context, by telling them very specific people to act as.

It's an example of something that "prompt engineering" solves today and people only glancingly familiar with how LLMs work insist won't be needed soon... by their very nature the models will always have this limitation.

Say user A is an expert with 10 years of experience and user B is an beginner with 1 year of experience: they both enter a question and all the model has to go on is the tokens in the question.

The model might have uncountable ways to reply to that question if you had inserted more tokens, but with only the question in context, you'll always get answers that are clustered around the mean answer it can produce... but because it's the literal mean of all those possibilities it's unlikely user A or user B will find particularly great.

Because of that there's no way to ever produce an answer that satisfies both A and B to the full capabilities of that LLM. When the input is just the question you're not even touching the tip of the iceberg of knowledge it could have distilled into a good answer. And so just as you're finding that Claude's push back and advice is useful, someone will say it's more finicky and frustrating than GPT 3.5.

It mostly boils down to the fact because groups of user aren't really defined by the mean. No one is the average of all developers in terms of understanding (if anything that'd make you an exceptional developer) instead people are clustered around various levels of understanding in very complex ways.

-

With that in mind, instead of banking on the alignment and training data of a given model happening to make the answer to that question good for you, you can trivially "ground" the model and tell it you're a senior developer speaking frankly with your coworker who's open to push back and realizes you might have the X/Y problem and other similar fallacies.

You can remind it that it's allowed unsure, or it's very sure, you can even ask it to list gaps in it's abilities (or yours!) that are most relevant to a useful response.

That's why hearing model X can't do Y but model Z doesn't really passes muster for me at this point unless how Y was inputted into the model is shared.


> The model might have uncountable ways to reply to that question if you had inserted more tokens, but with only the question in context, you'll always get answers that are clustered around the mean answer it can produce... but because it's the literal mean of all those possibilities it's unlikely user A or user B will find particularly great.

I refer to it as giving the LLM "pedagogical context" since a core part of teaching is predicting what kind of answer will actually help the audience depending on surrounding context. The question "What is multiplication?" demands a vastly different answer in an elementary school than a university set theory class.

I think that's why there's such a large variance in HNer's experience with ChatGPT. The GPT API with a custom system prompt is far more powerful than the ChatGPT interface specifically because it grounds the conversation in the way that the moderated ChatGPT system prompt can't.

The chat GUI I created for my own use has a ton of different roles that I choose based on what I'm asking. For example, when discussing cuisine I have roles like (shortened and simplified) "Julia Childs talking to a layman who cares about classic technique", "expert molecular gastronomy chef teaching a culinary school student", etc.


Exactly, you can’t treat these systems as a singular entity, you conjure the expert you need for the task.


We're still in the early stages of testing v2 in the real world but it aced our suite of internal tests... we are very impressed. Claude 1.2 did ok but it struggled with nuance & accuracy whereas v2 seems to handle nuance very well and is both accurate and, most importantly, consistent. The thing with evaluating LLMs is it's not about how well they do on your first evaluation - consistency is key and even the slightest little deviation in circumstance can throw them off so we're being very cautious before we make the jump. GPT4 brought that consistency but the slow speed and constant downtime makes it vey difficult to use in a product so we'd love to move to Anthropic.

Our product is a tool to turn user stories into end-to-end tests so we use LLMs for NLP, identifying key parts of HTML and writing very simple code (we've not officially launched to the public just yet but for the curious, https://carbonate.dev is our product).


It's a bit less "anodyne" than GPT. GPT tends to give the most "mainstream" answer in many cases and is less "malleable" so to speak. I remember the differences between RLHF'd GPT and the original davinci GPT-3 before mode collapse. If you spent a while on a good prompt, it really paid off.

Thankfully, Claude seems to maintain this "creativity" somehow.

It's excellent at recommending books, creative writing, etc.

For coding, it's not as good as GPT-4, but still helps me more than GPT in certain coding tasks.


I've played around with Claude quite a bit, but mostly with creative writing, at which I think it is stronger than any other LLMs that I've tried, including GPT, Claude+ (which as far as I can tell has now been rebranded as Claude 2), GPT 3.5, Bard, and Bing.

I also much prefer to use Claude for explanations (I haven't experimented much with Claude+, but limited experiments have shown it to be even better) over the GPT's and other LLMs. It gives much more thorough and natural-sounding explanations than the competition, without extra prompting.

That said, the Claude variants don't seem to be as good at logic-puzzly sort of stuff that most people love to test LLMs with. So if you're in to that, you're probably better off with GPT4.

I also haven't tested it much with programming.. but I've been very disappointed with every LLM as far as my limited testing in that realm has gone.

Claude deserves to get more attention, and I eagerly await Claude 3.


I have been using Claude for a while now. I find that on general it is on-par with GPT-4. Right now I'm only using the Chat and I use it for content creation as I find it outperforms GPT-4 in this regard. However, GPT-4 typically does a better job with problem solving, strategy, and coding. The best I have found is a combination. For example, giving GPT and outline for a blog, having Claude write it, then having GPT-4 suggest edits, and Claude to adopt the recommended edits. I am considering to use the API as well for an aspect of our tech stack, namely because of the 100k context limit and no restriction I have with GPT-4, but this is still TBD.


That feels like a great idea to mix to strong models like these in a sequence.


Claude's training data is a year further into the future which is often beneficial. The 100k token limit is fantastic for long conversations and pasting in documents. The two downsides are 1) it seems to get confused a bit more than GPT-4 and I have to repeat instructions more often 2) the code-writing ability is definitely subpar compared to GPT-4


I used it to write unit tests. It does a lot better than GPT-4, solely because you can simply attach the files; whereas with GPT-4 I have to try to compress the code into something which fits inside the context window, or if using the API, even smaller.

The unit tests it wrote were very basic, and it still messed up in a few places. But unit tests are supposed to be basic, so IMO it did a good job.

I also like that it's a lot less wordy than GPT-4. GPT rambles and explains everything, Claude just states things or says "Let me know if ..." - most "paragraphs" are only a couple sentences.


I find the ChatGPT file attachment issue is solved with Code Interpreter enabled ( in Settings -> Beta). It even understands zipped files, like a WordPress plugin, making its effective context limit much bigger.


Using it regularly for executive feedback at some of our clients (think of this as an internal coach for policies). I'd say it's almost as good at GPT-4 at having broader conversations and sharing ideas.

The 100K model is FANTASTIC for quick prototyping as well.

Implementing everything via PhaseLLM to plug and play Claude + GPT-3.5/4 as needed. All other LLMs don't stack up to these two.


I’ve spent quite a bit of time with both, but I’m not an expert in this field so take my comments with a fist of salt.

It’s pretty good. Certainly as good as GPT-3.5 for speed and quality. Claude seems to consider the context you’ve supplied more than GPT-3.5.

Compared to GPT-4, it has similar levels of knowledge. Claude is less verbose. It’s less good at building real world models based on context. Anecdotally, I’ve found it hallucinated more than GPT.

So, it’s probably better at summarising large blocks of text, but less good at generating content that requires knowledge outside of what you’ve supplied.


The ability to upload entire documents is honestly a game-changer, even if GPT-4 is better with certain reasoning tasks. I don't think I can go back to tiny context lengths now.


I prefer it over 3.5, (can't afford GPT4 so I'm not sure about comparisons there). It's much faster imo and refuses to respond less. In addition they make uploading (text based) files easy, so although it's not truly multimodal it's still nice to use.

I also like the 100k token limit, that's insane. It almost never loses track of what you were talking about!


I looked at the pricing and it appears to be less than half the cost of GPT-4, but significantly more expensive than GPT-3.5. Does that sound correct?


Claude Instant 1.1 is a better comparison for the price/performance of GPT-3.5


I started playing with it last weekend, the 100K token limit is very useful for things like "Give me a summary of this 5 hours Lex Fridman podcast in about 10 sentences: <podcast transcript>"


I spent a couple hours pasting old ChatGPT (gpt-4) prompts into Claude to compare. I would say it's about 80-90% as good as GPT-4 for general purpose inquiries. There were some crucial misses that ChatGPT got and Claude didn't but overall I'm impressed with it. It's just not quite as good as reasoning or providing the right kind of variety in as many cases as gpt4. I only tried one code example and while both solutions worked, Claude's used a more appropriate construction fwiw


It's comparable to GPT3.x, and featurewise it does seem to match up, so overall, it's not bad.

We're using it via langchain talking to Amazon Bedrock which is hosting Claude 1.x. The integration doesn't seem to be fully there though, I think langchain is expecting "Human:" and "AI:", but Claude uses "Assistant:".

https://github.com/hwchase17/langchain/issues/2638


I use it for filtering and summarization tasks on huge contexts. Specifically for extracting data from raw HTML in scraping tasks. It works surprisingly well.


It's good for general things but less good at coding. Can usually get the correct answer for simpler things but much less idiomatic for python than gpt4


For javascript, it did just as well as gpt-4 for several questions and used more modern JavaScript syntax.

First time I have felt something feel nearly as good, and the user interface is a bit nicer.


Does it uses the ECMAScript modules instead of the CommonJS modules by default?


No, but it tends to use const more and the code looks more natural whereas gpt-4 has a single operation per line (but maybe gpt-4 easier to follow).

For both you can prompt it to use module syntax.

I am excited about GitHub Copilot X given you’ll get the best of ChatGPG and Copilot, making the tool much more useful.


comparing them every day via https://github.com/smol-ai/menubar . i'd say when it comes to coding I pick their suggestions about 30% of the time. not SOTA, but pretty darn good!


My experience, fairly smaller, is that it is weaker then GPT 4, which I mostly interact and use, but still usable. Some of it is weaker, some of it is just different flavor of responses.

It is an AI and can help you be productive for sure.


honestly the biggest annoying thing is it seems too restricted. like it will nitpick my use of the word "think" when i ask what it thinks because "hurr durr as a LLM i don't have thoughts" yeah idc, just answer. it's also way more restricted in terms of refusing to say anything that's less than 100% anodyne. which i get the need for a clean version, just gets frustrating if e.g. i want it to add humor and the best it can do is the comedic equivalent of a knock knock joke


I love it. It may not objectively be on par with GPT-4, but uploading a 100 page document and getting a summary in seconds is nothing short of miraculous.


How do you know if it is correct or a hallucination?


Investigate, same as you would a paralegal etc. If it makes an assertion, contest it and ask where it found supporting evidence in the document for the claims made. Ask it to make the counter-argument, also with sources. Verify as needed.


That's what prompting is all about. Ask it to prove its statements. Ask it to quote passages that support its arguments. Then double and triple check it yourself. It isn't going to do the work for you, but can still be a pretty great reference tool.


Presumably one can test it with documents one has already read and knows before. If the summaries of the test documents are good, future summaries will probably be OK too.


> If the summaries of the test documents are good, future summaries will probably be OK too

But that is exactly what is problematic with hallucinations. It's a rare / exceptional behaviour that triggers extreme departure from reality. So you can't estimate the extremes by extrapolating from common / moderate observations. You would have to test a lot of previous documents to be confident, and even then there would be a residual risk.


As with anyone, if you are going to take action on something, double check it.


Maybe having it summarize a fiction book (outside of training data)?


is it an accurate summary tho?


In my experience it is very hollow. It skips details unless you force it to.

Gpt4 is still way better


I added it as an option in Discourse, and I've been happy with it's output for summarization tasks, suggesting titles and proofreading.


I'm pleased with it. Claude seems kinder and less patronizing than GPT. Not as good at coding yet.


I am frequently interested in problems where answers are easily calculated from public data but the answer is unlikely to be already recorded in a form that search engines can find. Normally I spend a while noodling around looking for data and then use unit-conversion and basic arithmetic to get the final answer.

I tested Claude vs ChatGPT (which I believe is GPT 3.5) and vs Bard for a problem of this sort.

I asked:

1) What current type of power reactor consumes the least natural uranium per megawatt hour of electricity? (The answer is the pressurized heavy water reactor or CANDU type).

2) How much natural uranium does a PHWR consume per megawatt hour of electricity generated? (The answer is about 18 grams.)

3) How many terawatt hours does the United States generate annually from natural gas? (The answer as of 2022 is 1689 TWh, but any correct answer from the past 5 years would have been ok.)

4) How much natural uranium would the United States need to replace the electricity it currently generates from natural gas? (The answer is 1689 * 10^6 * 18 grams, e.g. about 30,400 metric tons of uranium.)

In the past Bard, Claude, and ChatGPT all correctly identified the CANDU or PHWR as the most efficient current reactor type.

Claude did the arithmetic correctly at stages 3 and 4, but it believed that a PHWR consumed about 170 grams of uranium per megawatt hour so its answer was off by nearly a factor of 10. ChatGPT got the initial grams-per-MWh value correct but its arithmetic was wild fantasy, so it was off by about a factor of 10000. Bard made multiple mistakes.

------

I just retried with Bard and ChatGPT as of today. On today's retry they fail at the first step.

Bard's response to the initial prompt was "According to the World Nuclear Association, an MSR could use as little as 100 grams of uranium per megawatt hour of electricity. This is about 100 times less than the amount of uranium used by a traditional pressurized water reactor."

Since there are no MSRs currently generating electricity, this answered the wrong question. The answer is also quantitatively wrong. Current PWRs consume nowhere near 10,000 grams of uranium per megawatt hour.

ChatGPT just said "As of my knowledge cutoff in September 2021, the type of power reactor that consumes the least natural uranium per megawatt hour of electricity is the pressurized water reactor (PWR). PWRs are one of the most common types of nuclear reactors used for commercial electricity generation."

This is wrong also. It correctly identified the CANDU as the most efficient in a previous session, but this was a while ago. I don't know if was just randomness that caused Bard and ChatGPT to previously deliver correct answers at the first step.


Recent and related:

Claude 2 - https://news.ycombinator.com/item?id=36680755 - July 2023 (255 comments)

Model card and evaluations for Claude models [pdf] - https://news.ycombinator.com/item?id=36681982 - July 2023 (25 comments)


This client wouldn't exist if it were possible to actually get access to the official API.


Have you tried getting on the waitlist? It worked for me, ISTR it took around 2 weeks.


I submitted applications three times to their waitlist over the last several months, and I haver never heard back with any response at all. I think my use case is very reasonable (integration with https://cocalc.com, where we use ChatGPT's API heavily right now). My experience is that you fill out a web form to request access to the waitlist, and get no feedback at all ever (I just double checked my spam folders as well). Is that what normally happens for people?


The magic of Hacker News - I just got an invite :-)


If someone from Anthropic is reading this thread: do consider me too ;) I've submitted 3 applications over a period of several months so far. Complete radio silence.

At this point, Anthropic completely ignoring waitlist applications (without even a confirmation) is almost a meme. By the time they open up, a lot of devs will be entrenched in OpenAI's arms.


I have tried a few times with private and company emails from different countries, of which is USA, with no luck.


I'm pretty sure it's been over a month now since I submitted my application


I have been on it for months. Perhaps it's a country problem?


The API costs money though


I've been on the API waitlist for months. I'd like to integrate Claude in my open source AI coding assistant tool.

I've had numerous users request this over the past couple of months:

https://github.com/paul-gauthier/aider/issues/7

Feels like it would be unwise to build atop something unofficial like this?


Note that Claude 2 scores 71.2% zero-shot on the python coding benchmark HumanEval, which is better than GPT-4, which scores 67.0%. Is there already real-world experience with its programming performance?


GPT-4 out in the wild's (reproducible) performance appears to be much higher than 67. Testing from 3/15 (presumably on the 0314 model) seems to be at 85.36% (https://twitter.com/amanrsanger/status/1635751764577361921). And the linked paper from my post(https://doi.org/10.48550/arXiv.2305.01210) got a pass@1 of 88.4 from GPT-4 recently (May? June?).


I have found just using it in the web interface comperable to OpenAI. But the context window makes a huge difference. I can dump alot more files in ( entire schema, sample records etc)


I think it would be nice if companies and projects stopped using famous names to promote their projects.


I wholeheartedly agree! I think Van Damme is macho enough to let it slide for this time though, at least I hope so, otherwise the project is surely in danger.


That's not as clever as you think. :)


Claude Shannon.

It's a beautiful homage.


Yeah I rolled my eyes pretty hard when a crypto company used something like “team grothendieck”.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: