Hacker Newsnew | past | comments | ask | show | jobs | submit | prats226's commentslogin


Then you can just download finetuned version of same multi-modal foundation model that's trained on documents?


Top 3 models on huggingface are all OCR models. Most automation projects involve documents where you need a model finetuned to understand all elements inside documents and provide grounding and confidence scores etc which is why these subset of models are gaining popularity


Would be intersting to see where funding goes to fix these issues. News would heavily impact public opinion and hence political influence and public funding.


Yes, and its not just OCR (Optical Character Recognition), it understands layouts, captures signatures, charts, watermarks etc so way beyond just characters


Here is link to open source model: https://huggingface.co/nanonets/Nanonets-OCR-s

And hosted model: https://docstrange.nanonets.com/


What service do you use to get notifications about nanonets mentions on HN?



Context engineering is another name people have given to same skill?


As a reader of Simon's work, I can speculate an answer here.

All "designing agentic loops" is context engineering, but not all context engineering is designing agentic loops. He's specifically talking about instructing the model to run and iterate against an evaluation step. Sure, that instruction will end up in the context, but he's describing creating a context for a specific behavior that allows an agent to be more effective working on its own.

Of course, it'll be interesting to see if future models are taught to create their own agentic loops with evaluation steps/tests, much as models were taught to do their own chain of thought.


I think that's actually quite different.

Context engineering is about making sure you've stuffed the context with all of the necessary information - relevant library documentation and examples and suchlike.

Design the agentic loop is about picking the right tools to be provided to the model. The tool descriptions may go in the context but you also need to provide the right implementations of them.


Reason I felt like they are closely connected are because for designing tools for lets say coding agents, you have to be thoughful of context engineering.

Eg linear MCP is notorious for giving large JSONs which quickly fill up context and hard for model to understand. So tools need to be designed slightly differently for agents keeping context engineering in mind compared to how you design them for humans.

Context engineering feels like more central and first-principle approach of designing tools, agent loops.


They feel pretty closely connected. For instance: in an agent loop over a series of tool calls, which tool results should stay resident in the context, which should be summarized, which should be committed to a tool-searchable "memory", and which should be discarded? All context engineering questions and all kind of fundamental to the agent loop.


Yeah, "connected" feels right to me.

Those decisions feel to me like problems for the agent harness to solve - Anthropic released a new cookbook about that yesterday: https://github.com/anthropics/claude-cookbooks/blob/main/too...


One thing I'm really fuzzy on is, if you're building a multi-model agent thingy (like, can drive with GPT5 or Sonnet), should you be thinking about context management tools like memory and autoediting as tools the agent provides, or should you be wrapping capabilities the underlying models offer? Memory is really easy to do in the agent code! But presumably Sonnet is better trained to use its own builtins.


It boils down to information loss in compaction driven by LLM's. Either you could carefully design tools that only give compacted output with high information density so models have to auto-compact or organize information only once in a while which eventually is going to be lossy.

Or you just give loads of information without thinking much about it, assuming models will have to do frequent compaction and memory organization and hope its not super lossy.


Right, just so I'm clear here: assume you decide your design should be using a memory tool. Should you make your own with a tool call interface or should you rely on a model feature for it, and how much of a difference does it make?


Do you think this'll eventually be trained into the models the way that chain-of-thought has been?


To a certain extent it has already - models are already very good at picking tools to use: ask for a video transformation and it uses ffmpeg, ask it to edit an Excel sheet and it uses Python with openpyxl, etc.

My post is more about how sometimes you still need to make environment design decisions yourself. My favorite example is the Fly.io one, where I created a brand new Fly organization with a $5 spending limit and issue an API token that could create resources in that organization purely so the coding agent could try experiments to optimize cold start times without messing with my production Fly environment.

An agent might be able to suggest that pattern itself, but it would need a root Fly credential in order to create itself the organization and restricted credentials and given how unsafe agents with root credentials are I'd rather keep that step to myself!


It's amusing to think that the endgame is that the humans in the loop are parents with credit cards.

I suppose you could never be sure that an agent would explicitly follow your instruction "Don't spend more than $5".

But maybe one could build a tool that provides payment credentials, and you get to move further up the chain. E.g., what if an MCP tool could spin up virtual credit cards with spending caps, and then the agent could create accounts and provide payment details that it received from the tool?


You can always put automation for your google home to blast music at full volume at right time. And if you don't wake up from sound of music yourself, your neighbour will knock on your door for sure!


Ha! Or the police.


With google serving AI overviews, now an average search query should cost more? Compute is getting cheaper but also algorithms getting more and more complex, increasing compute?


Read long time ago that even SFT for conversations vs base model for autocomplete reduces intelligence, increases perplexity


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: