Hacker Newsnew | past | comments | ask | show | jobs | submit | grbsh's commentslogin

The fundamental frustration most engineers have with AI coding is that they are used to the act of _writing_ code being expensive, and the accumulation of _understanding_ happening for free during the former. AI makes the code free, but the understanding part is just as expensive as it always was (although, maybe the 'research' technique can help here).

But let's assume you're much better than average at understanding code by reviewing it -- you have another frustrating experience to get through with AI. Pre-AI, let's say 4 days of the week are spend writing new code, while 1 day is spent fixing unforseen issues (perhaps incorrect assumption) that came up after production integration or showing things to real users. Post-AI, someone might be able to write those 4 days worth of code in 1 day, but making decisions about unexpected issues after integration doesn't get compressed -- that still takes 1 day.

So post-AI, your time switches almost entirely from the fun, creative act of writing code to the more frustrating experience of figuring out what's wrong with a lot of code that is almost correct. But you're way ahead -- you've tested your assumptions much faster, but unfortunately that means nearly all of your time will now be spent in a state of feeling dumb and trying to figure out why your assumptions are wrong. If your assumptions were right, you'd just move forward without noticing.


Why not just use Claude by itself? Opus and Sonnet are great at producing pixel coordinates and tool usages from screenshots of UIs. Curious as to what your framework gives me over the plain base model.


Hey! To have a framework that can effectively control browser agents, you need systems to interact with the browser, but also pass relevant content from the page to the LLM. Our framework manages this agent loop in a way that enables flexible agentic execution that can mix with your own code - giving you control but in a convenient way. Claude and OpenAI computer use APIs/loops are slower, more expensive, and tailored for a limited set of desktop automation use cases rather than robust browser automations.


I recently got the Pavlok device to stop a bad habit - compulsive nail biting. It worked extremely well! It works less well for thinks like distraction, because you have to manually trigger the negative stimulus.

I find the prospect of an AI continuously monitoring and course correcting very interesting. Like GPS/ maps -- if you go off course, it re-routes you. What will people build with real-time AI like this in the future? I think Cluely may be the best example of this.

I'd personally like a version of the Pavlok device that has a built in camera / mic that allows me to configure these type of triggers with real-time monitoring. It'll probably be a while until we can run an LLM on a watch-sized device in 1-3s but still have ~8hr battery life.


Knowledgework is AI that can actually save you time, make you better organized, and do better work. Knowledgework is the answer to question, “Why isn’t ChatGPT more useful for real-world work tasks?”

While ChatGPT is a better Google search, Knowledgework is like asking a copy of your own brain for help. Today’s chat interfaces are flawed because they require the user to micromanage: to not only know exactly what can and should be done, but to describe in detail all the specifics required to do it. Knowledgework fixes this with two elements that I believe create a new paradigm for AI enabled software: proactivity and omniscience. Without these two properties, you can’t delegate meaningful real world tasks to AI.

ChatGPT isn’t useful for this. Not because it’s not intelligent enough (it is), but because it lacks the right knowledge. When I speak to knowledge workers about how they use AI for work, they tell me they’d like to delegate more tasks to AI, but that they experience this frustration of needing to micromanage it, re-teaching everything about their project and their team every time.

How does Knowledgework solve omniscience and proactivity to create an AI assistant that’s useful for delegation of real-world work? It’s a desktop vision AI that watches you work, like an intern who is shadowing you, or a pair programmer. It learns the rich, internal (often unwritten) knowledge specific to your projects that's required to contextualize them. It organizes this into neat, understandable documentation of everything you’re doing: a hyperlinked wiki that connects all of your team’s concepts, decisions, definitions, acronyms, tools, etc.

This explicit representation of your knowledge powers the AI assistant to enable useful delegation — but it’s also really useful in it of itself. Curious as to how your team came to a decision on something? Click through the wiki to get context.

The other main feature is the Timeline. It’s kind of like a log or an objective summary of how you spent your time. While the wiki mirrors humans’ associative and conceptual memory, the timeline represents episodic memory. This enables you to visually search through your time: imagine you remember solving a similar problem a few weeks ago, but you don’t quite remember when. By going through the Timeline, you can quickly scan to find the specific work session and ask about what you did.

Together, these representations of your knowledge and experience along with the AI assistant running on top begin to feel like a sort of “digital second brain”. Since I started using it, I’ve had the experience where I’m hesitant to even do things on other devices, because it feels like anything I do there is ephemeral.

If you’re excited to upload your mind and see what the future looks like with this tech, sign up for the waitlist here: https://knowledgework.ai.


I know moondream is cheap / fast and can run locally, but is it good enough? In my experience testing things like Computer Use, anything but the large LLMs has been so unreliable as to be unworkable. But maybe you guys are doing something special to make it work well in concert?


So it's key to still have a big model that is devising the overall strategy for executing the test case. Moondream on its own is pretty limited and can't handle complex queries. The planner gives very specific instructions to Moondream, which is just responsible for locating different targets on the screen. It's basically just the layer between the big LLM doing the actual "thinking" and grounding that to specific UI interactions.

Where it gets interesting, is that we can save the execution plan that the big model comes up with and run with ONLY Moondream if the plan is specific enough. Then switch back out to the big model if some action path requires adjustment. This means we can run repeated tests much more efficiently and consistently.


Ooh, I really like the idea about deciding whether to use the big or small model based on task specificity.



Oh this is interesting. In our case we are being very specific about which types of prompts go where, so the planner essentially creates prompts that will be executed by Moondream, instead of trying to route prompts generally to the appropriate model. The types of requests that our planner agent vs Moondream can handle are fundamentally different for our use case.


interesting, will check out yours i'm mostly interested on these dynamic routers so I can mix local and API based depending on needs, i cannot run some models locally but most of the tasks don't even require such power (on building ai agentic systems)

there's also https://github.com/lm-sys/RouteLLM

and other similar

I guess your system is not as open-ended task oriented so you can just build workflows deciding which model to use at each step, these routing mechanisms are more useful for open-ended tasks that dont fit on a workflow so well (maybe?)


Why would you ever hire a human to perform some task for you in a company? They're known for having problems with ambiguity and precision in communication.

Humans require a lot of back and forth effort for "alignment" with regular "syncs" and "iterations" and "I'll get that to you by EOD". If you approach the potential of natural interfaces with expectations that frame them the same way as 2000s era software, you'll fail to be creative about new ways humans interact with these systems in the future.


It's a great point that this is how we primarily used to interact with businesses and services, but we've moved on. For Gen-Z, e.g., many will refuse to use the product or service if they have to speak to an actual human. Just like we're now not willing to take boat across the ocean for 3 months, but before airplanes this was not uncommon.


Taking a 3 month voyage was still an uncommon thing to do for a person, it’s just that it was the most common type of intercontinental journey due to lack of competition.


I think we can have the best of both worlds here. We want the precision and speed of using vi commands, but we want the discoverability of GUI document editors. LLMs may be able to solve the discoverability problem. If the editor can be highly confident that you want to use a given a command, for example, it can give you an intellisense like completion option. I don't think we've cracked the code on how this UX should work yet though -- as evidenced by how many people find cursor/copilot autocompletion suggestions so frustrating.

The other great thing about this mode is that it can double as a teaching methodology. If I have a complicated interface that is not very discoverable, it may be hard to sell potential users on the time investment required to learn everything. Why would I want to invest hours into learning non-transferrable knowledge when I'm not even sure I want to go with this option versus a competitor? It will be a far better experience if I can first vibe-use the product , and if it's right for me, I'll probably be incented to learn the inner workings of it as I try to do more and more.


> We want the precision and speed of using vi commands, but we want the discoverability of GUI document editors.

> The other great thing about this mode is that it can double as a teaching methodology.

gvim has menus and puts the commands in the menus as shortcuts. I learned from there vim has folding and how to use it.


Tesla / Waymo is a perfect illustration of the point, but the Bitter Lesson doesn’t allow us to pick a winner here. The Bitter Lesson tells us that the Tesla approach (fully end to end, minimizing hand coded features / logic) will _ultimately_ win out. The Bitter Lesson does not tell us that this approach has to economically justify itself 1 year in, 5 years in, or that the approach when the technology is immature will allow a company to avoid bankrupting itself in the meantime while they wait for the data and compute to scale.

In other words, just because we know that ultimately (possibly in 20+ years) the Tesla compute-only approach will be simpler and more effective, Tesla might not survive to see this happen. Instead, manual feature engineering and hacking can always give temporary gains over data and compute driven approaches. The bitter lesson was clear about this. I suspect Waymo will win, and at some point in the future once they are out of their growth at all costs stage, they will transition into their maximum value extraction stage, in which vision will make significantly more economic sense than LiDAR. But once they win, they’ll have plenty of time to see the bitter lesson through its ultimate consequences. Elon is right, but he’s probably too early.


That's religion, not a predictive theory.

The Bitter Lesson has held up in a lot of domains where injecting human inductive bias was detrimental. Adding LIDAR for example is not inductive bias - it's a strictly superior form of sensing. You won't call a wolf's sense of smell "hand engineered features" or a cat's reflexes a failure of evolution to extract more signal from an inferior sensory input.

Waymo will win because they want to make a product that works and not be ideological about it - that's ultimately what matters.


“The vast majority of our work is already automated to the point where most non-manual workers are paid for the formulation of problems, social alignment in their solutions, ownership of decision making / risk, action under risk, and so on”

Exactly! What a perfect formulation of the problem.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: