Got mine after my first Acid trip (still don't know if it was real acid). Its not debilitating for me, just annoying. So yeah, be careful out there folks. The Acid trip was very cerebral though and I consider it to be an important experience in my life so I am kind of on the fence that it might have been worth the trade off....
Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.
Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.
It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.
I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.
For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.
what's your use case and what specific LLMs are you using?
I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!
For me at least its an interesting project I can take apart and build on top of. I've built 100% my own agent frameworks from scratch and have learned a lot from them. There is something to be said on learning from others projects as well, also because its an ever evolving project with so many contributes whatever fork you go with of your own, theirs a good chance the new goodies will work with your own modified version. For example I'm looking in to LCM right now, and woo-dent you know it someone ported it to openclaw. But nanobot doesn't have it, so I'm considering working on the LCM port to that. If i succeed i will learn a lot and also contribute to progress in my own little ways.
How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...
As we approach the singularity things will be more noisy and things will make less and less sense as rapid change can look like chaos from inside the system. I recommend folks just take a deep breath, and just take a look around you. Regardless on your stance if the singularity is real, if AI will revolutionize everything or not, just forget all that noise. just look around you and ask yourself if things are seeming more or less chaotic, are you able to predict better or worse on what is going to happen? how far can your predictions land you now versus lets say 10 or 20 years ago? Conflicting signals is exactly how all of this looks. one account is saying its the end of the world another is saying nothing ever changes and everything is the same as it always was....
Do you have any resources or youtube videos that might also help someone understand the lcm context management a bit better. I think there's something to this, but i'm having trouble wrapping my head around it. i learn well with analogies and im trying to really grok the concept here. If there are other ways you could explain it it would be appreciated. mind you i have built my own agents from scratch so im not a total novice in these areas. my agents already manage context with sub-agents and multi layered conversational histories with RAG thrown in there. But i dont want to make wrong assumptions about your implementations and miss the nuanced important bits. regardless, ill try my best to reread the article and hash it out on my own, thanks for the paper.
We don't have any other materials yet, but let's see if this lands for you. I can run you through a couple simpler versions of the system, why they don't work, and how that informs our ultimate design.
The most basic part of the system is "two layers". Layer 1 is the "ground truth" of the conversation - the whole text the user sees. Layer 2 is what the model sees, i.e., the active context window.
In a perfect world, those would be the same thing. But, as you know, context lengths aren't long enough for that, so we can't fit everything from Layer 1 into Layer 2.
So instead we keep a "pointer" to the appropriate part of Layer 1 in Layer 2. That pointer takes the form of a summary. But it's not a summary designed to contain all information. It's more like a "label" that makes sure the model knows where to look.
The naive version of the system would allow the main model to expand Layer 2 summaries by importing all of the underlying data from Layer 1. But this doesn't work well, because then you just end up re-filling the Layer 2 context window.
So instead you let the main model clone itself, the clone expands the summary in its context (and can do this for multiple summaries, transforming each into the original uncompressed text), and then the clone returns whatever information the main thread requires.
Where this system would not fully match the capabilities of RLMs is that, by writing a script that calls itself e.g. thousands of times, an RLM has the ability to make many more recursive tool calls than can fit in a context window. So we fix that using operator-level recursion, i.e., we give the LLM a tool, map, that executes arbitrary recursion, without the LLM having to write a custom script to accomplish that.
I am in the process of trying to integrate LCM in to my own personal assistant agent for its context management system. The main human facing agent will not be a coding agent so ill be modifying the system prompt and some other things quite heavily but core concepts of the system will be as the backbone. Now that I am paying around with it, I am hoping you can answer some questions. I notice that the system prompt of the agent mutates as local time is injected in to the system prompt itself. If that's whats happening, you are destroying any hopes of caching from the provider are you not? Am I reading this correctly or was this a deliberate choice for some reason... instead of appending at the end of the users turn like a system metadata info that way you preserve the head? Thanks.
Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.
reply