Hacker Newsnew | past | comments | ask | show | jobs | submit | srush's commentslogin

A lot of people use it! It scores very well on our benchmarks, significantly better than Composer-1.


There are lots of good models we like here. But we agree that getting the right point on the smart+fast graph can make agentic coding feel really good.

(Cursor researcher)


We will do our best. Luckily I don't think there are major telecom companies called Composer-2.


There is also a very polular package manager called Composer. Do companies not search for name collisions? Or do they squat on community projects on purpose?


Unfortunately not, as we used our own internal code for the benchmark. We would also like to see more benchmarks that reflect the day-to-day agentic coding use.


Is there any information at all available, anywhere, on what Cursor Bench is testing and how?

It's the most prominent part of the release post - but it's really hard to understand what exactly it's saying.


Roughly, we had Cursor software engineers record real questions they were asking models, and then had them record the PR that they made that contained the result. We then cleaned these up. That is the benchmark.


Are you able to give a sense of how many questions, which domains they were split over, and how that split looked in % terms?

As a user, I want to know - when an improvement is claimed - whether it’s relevant to the work I do or not. And whether that claim was tested in a reasonable way.

These products aren’t just expensive - it requires switching your whole workflow. Which is becoming an increasingly big ask in this space.

It’s pretty important for me to be able to understand, and subsequently, believe a benchmark - I find it really hard not to read it as ad copy where this information isn’t present.


Which programming languages/tools/libraries did the teams questions/code involve?


We like the name Composer and were sad to see it go. Excited to bring it back. (Agree Cheetah is a cool name too.)


There is a footnote that should help with the models. Training is a harder thing to report on, but roughly our finding here is that RL scales.



I would have thought it's because you use Cursor...


Agree that Sonnet 4.5 is an excellent model. Would be curious to hear your experience using Composer though, it's quite good.


I'll try it out! I haven't yet - just generally conveying my opinion that I personally weigh "better model" much more important than speed, assuming some "fast enough"

Also, didn't realize you worked at Cursor - I'm a fan of your work - they're lucky to have you!


Thanks! Yeah, been working here for 9 months now. Fascinated byt agentic coding both as a researcher and user.

Totally agree that "smart model" is the table stakes for usefulness these days.


> Composer though, it's quite good

Wow, no kidding. It is quite good!


We also are big Tab users here at Cursor. In the blog we talk about the motivation for this project came from thinking about a Tab-like agent.


Hi everyone,

I am an ML researcher at Cursor, and worked on this project. Would love to hear any feedback you may have on the model, and can answer question about the blog post.


Impressive systems write-up. A question: if Composer is an RL finetune on an open model, why keep weights closed? The edge from a slightly better checkpoint erodes quickly in this market, it's not a durable advantage. Composer protects Cursor's margins from being squeezed by the big AI labs, but that is true whether the weights are open or closed, and I think Cursor would have more lasting benefit by generating developer goodwill than from a narrow, short-lived advantage. But, that's just my opinion. I personally find it hard to get excited about yet-another proprietary model. GPT-5 and Sonnet 4.5 are around when I need one of those, but I think the future is open.


It's stunning.

I don't use these tools that much ( I tried and rejected Cursor a while ago, and decided not to use it ) but having played with GPT5 Codex ( as a paying customer) yesterday in regular VSCode , and having had Composer1 do the exact same things just now, it's night and day.

Composer did everything better, didn't stumble where Codex failed, and most importantly, the speed makes a huge difference. It's extremely comfortable to use, congrats.

Edit: I will therefore reconsider my previous rejection


Awesome to hear, I will share with the team.


Why did you stop training shy of the frontier models? From the log plot it seems like you would only need ~50% more compute to reach frontier capability


We did a lot of internal testing and thought this model was already quite useful for release.


Makes sense! I like that you guys are more open about it. The other labs just drop stuff from the ivory tower. I think your style matches better with engineers who are used to datasheets etc. and usually don't like poking a black box


Thanks! I do like the labs blog posts as well though, OpenAI and Anthropic have some classics.


Which model did you distill it from? Great work! PS getting a few scenarios where it doesn't follow rules as well as sonnet 4.5


The blog talks about the training process. Specifically we trained with RL post-training on coding examples.


Makes sense, but what model was used for the base? Is it some open-source model, and you're not at liberty to disclose?


not a Cursor employee but still a researcher, it’s Zhipu/Z.ai GLM-4.6/4.5. there’s traces of Chinese in the reasoning output + its the only model that would make sense to do this with RL, and is a model that already delivers near SOTA performance + is open-source/open-weight.

Cursor Composer and Windsurf SWE 1.5 are both finetuned versions of GLM.


interesting, thank you


that's cool thanks!


Do you have any graphs handy that kind of replicates the one used first in the blog post but a bit less ambiguous, maybe without model grouping? I feel like it would have been a bit more fair to include proper names, and individualize them rather than group everything together by something, and then present your own model on its own.


Is the new model trained from scratch? What training data went into it?


Is it true that Cheetah is Grok Code Fast 2? Does this mean that the new Cursor model is also based on Grok?


Cheetah was an earlier (and dumber) version of this model that we used to test production speed. They are both developed in-house. If you liked Cheetah, give this model a try.


This is nice. I liked Cheetah for grunt work that I want to get out quickly and is not too hard. The speed is really awesome. A model that would run at even higher speeds like the OSS models at groq/cerebras would really be workflow changing, because the slowness of SOTA models really breaks the flow. I find myself taking a ton of breaks and getting distracted while I wait for a model to complete a task (e.g. just now).


Let us know how you like it.


Awesome, thanks for the clarification. So are the rumors around Cheetah being based on a Grok model just straight up untrue? I want to try Composer but have a pretty strict no X/Grok policy.


Straight up untrue.


There is a youtube livestreamer building with it now, if you are looking for direct feedback: https://www.youtube.com/watch?v=1bDPMVq69ac


neat!


Congratulations on your work. I spent the day working with a mix of the Composer/Sonnet 4.5/Gemini 2.5 Pro models. In terms of quality, the Composer seems to perform well compared to the others. I have no complaints so far. I'm still using Claude for planning/starting a task, but the Composer performed very well in execution. What I've really enjoyed is the speed. I had already tested other fast models, but with poor quality. Composer is the first one that combines speed and quality, and the experience has been very enjoyable to work with.


I prefer the approach of focusing on faster models despite their lower intelligence because I want my IDE to fly when I can see the code. I find this useful when I need to manually debug something that any model is able to do, so I know it's going to fail but at least it will fail fast. On the other hand, if I need more intelligence I have my other CLI that doesn't allow me to see the code but gets the planning and difficult code done.


Our view is that there is a now a minimal amount of intelligence that is necessary to be productive, and that if you can pair that with speed that is awesome.


What's funny is there's many industries outside A.I. that pick their talent the same way. ;)


is Composer a fine tune of an existing open source base model?


Our primary focus is on RL post-training. We think that is the best way to get the model to be a strong interactive agent.


So, yes, but you won’t say what the base model is? :)


It seems like a sort of sonnet model as a lot of people are reporting it like to spam documentation on Twitter like sonnet 4.5


Can you please tell us more about how you used Ray for setting up the RL infrastructure?


Oh good question. Actually speaking at the Ray Summit next week in SF so we will talk more about it. We used Ray throughout the pipeline for running evals, for the RL controller, for data collation, and for visualizations. One tool we found helpful was Ray Data which let us easily scale over data and run logs.


Please share more about Ray Data use case.


We use Ray data for our map-style processing jobs. For example one tool have runs over all the rollouts from the RL system and collects qualitative statistics to understand which type of agent trajectories are being reward, and what types of searches and terminal commands are being made.


Amazing work! The UX is great.

GPT-5-codex does more research before tackling a task, that is the biggest weakness for me not using Composer yet.

Could you provide any color on whether ACP (from zed) will be supported?


How many times have you needed to reset the optimizer during the RL training cycles?


How do you work with multiple agents?


We train with a single agent. is that the question?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: