Author here. The browser is made from scratch (not based on Chromium/Webkit), in Zig, using v8 as a JS engine.
Our idea is to build a lightweight browser optimized for AI use cases like LLM training and agent workflows. And more generally any type of web automation.
It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch). So expect most websites to fail or crash. The plan is to increase coverage over time.
When I've talked to people running this kind of ai scraping/agent workflow, the costs of the AI parts dwarf that of the web browser parts. This causes computational cost of the browser to become irrelevant. I'm curious what situation you got yourself in where optimizing the browser results in meaningful savings. I'd also like to be in that place!
I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website. But it should even out or get worse as the websites get richer. The nature of web scraping is that the worst sites take up the vast majority of your cpu cycles. I don't think lowering the ram usage of the browser process will have much real world impact.
The cost of the browser part is still a problem. In our previous startup, we were scraping >20 millions of webpages per day, with thousands of instances of Chrome headless in parallel.
Regarding the RAM usage, it's still ~10x better than Chrome :) It seems to be coming mostly from v8, I guess that we could do better with a lightweight JS engine alternative.
Yes but WebKit is not a browser per se, it's a rendering engine.
It's less resource-intensive than Chrome, but here we are talking orders of magnitude between Lightpanda and Chrome. If you are ~10x faster while using ~10x less RAM you are using ~100x less resources.
Careful, as you implement misssing features your RAM usage might grow too. Happened to many projects, lean at the beggining, get's just as slow when dealing with real world mesiness.
Yeah, could be nice to allow the user to select the type of ECMAScript engine that fits their use-case / performance requirements (balancing the resources available).
Generally, for consumer use cases, it's best to A) do it locally, preserving some of the original web contract B) run JS to get actual content C) post-process to reduce inference cost D) get latency as low as possible
Then, as the article points out, the Big Guns making the LLMs are a big use case for this because they get a 10x speedup and can begin contemplating running JS.
It sounds like the people you've talked to are in a messy middle: no incentive to improve efficiency of loading pages, simply because there's something else in the system that has a fixed cost to it.
I'm not sure why that would rule out improving anything else, it doesn't seem they should be stuck doing nothing other than flailing around for cheaper LLM inference.
> I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website.
I'm a bit lost, the ram usage benchmark says its ~10x less, and you feel its deceptive because you'd expect ram usage to be less? Steelmanning: 10% of Chrome's usage is still too high?
The benchmark shows lower ram usage on a very simple demo website. I expect that if the benchmark ran on a random set of real websites, ram usage would not be meaningfully lower than Chrome. Happy to be impressed and wrong if it remains lower.
Very impressive! At Airtop.ai we looked into lightweight browsers like this one since we run a huge fleet of cloud browsers but found that anything other than a non-headless Chromium based browser would trigger bot detection pretty quickly. Even spoofing user agents triggers bot detection because fingerprinting tools like FingerprintJS will use things like JS features, canvas fingerprinting, WebGL fingerprinting, font enumeration, etc.
Can you share if you've looked into how your browser fares against bot detection tools like these?
Please put a priority on making it hard to abuse the web with your tool.
At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that. It goes without saying that you should not allow users to make hundreds or thousands of "blind" parallel requests as these tend to have the effect of DoSing sites that are being hosted on modest hardware. You should also be measuring response times and throttling your requests accordingly. If a website issues a response code or other signal that you are hitting it too fast or too often, slow down.
I say this because since around the start of the new year, AI bots have been ravaging what's left of the open web and causing REAL stress and problems for admins of small and mid-sized websites and their human visitors: https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...
This is HN virtue signaling. Some fringe tool that ~nobody uses is held to a different, weird standard and must be the one to kneecap itself with a pointless gesture and a fake ethical burden.
The comparison to DRM makes sense. Gimping software to disempower the end user based on the desires of content publishers. There's even probably a valid syllogism that could make you bite the bullet on browsers forcing you to render ads.
Not sure what "digital rights" that "manages"? I don't see it as an unreasonable suggestion that the tool shouldn't be set up out of the box to DoS sites it's scraping, that doesn't prevent anyone who is technical enough to know what they're doing to fork it and remove whatever limits are there by default? I can't see it as a "my computer should do what I want!" issue, if you don't like how this package works, change it or use another?
Indeed DRM is a very different thing from adhering to standards like `robots.txt` as a default out of the box (there could still be a documented option to ignore it).
He was using DRM as a metaphor for restricted software.
And advocating that software should do whatever the user wants.
If the user is ignorant about the harm the software does, then adding robots.txt support is win-win for all.
But if the user doesn't want it, then it's political, in the same way that DRM is political and anti-user.
This is software telling you what you are allowed to do based on what the software developer wants* (assuming the developers cares of course...). Which is how all software works. I would not want my users of my software doing anything malicious with it, so I would not give them the option.
If I create an open-source messaging app I am also not going to give users the option of clicking a button to spam recipients with dick pics. Even if it was dead-simple for a determined user to add code for this dick pic button themselves.
Who told you about DRM? It is an open source tool.
Simply requiring a code change and a rebuild is enough of a barrier to prevent rude behavior from most people. You won't stop competent malicious actors but you can at least encourage good behavior.
If popular, someone will make a fork but having the original refuse to do stuff that are deemed abusive sends a message.
It is like for the Flipper Zero. The original version does not let you access frequency bands that are illegal in some countries, and anything involving jamming is highly frowned upon. Of course, there are forks that let you do these things, but the simple fact that you need to go out of your way to find these should tell you it is not a good idea.
I feel like you may have a misunderstanding of what DRM is. Talking about DRM outside the context of media distribution doesn't really make any sense.
Yes, someone can fork this and modify it however they want. They can already do the same with curl, Firefox, Chromium, etc. The point is that this is project is deliberately advertising itself as an AI-friendly web scraper. If successful, lots of people who don't know any better are going to download it and deploy it without a full understanding (and possibly caring) of the consequences on the open web. And as I already point out, this is not hypothetical, it is already happening. Right now. As we speak.
Do you want cloudflare everywhere? This is how you get cloudflare everywhere.
My plea for the dev is that they choose to take the high road and put web-server-friendly SANE DEFAULTS in place to curb the bulk of abusive web scraping behavior to lessen the number of gray hairs it causes web admins like myself. That is all.
It's exactly DRM, management of legal access to digital content. The "media" part has been optional for decades.
The comment they replied to didn't suggest sane defaults, but DRM. Here's the quote, no defaults work that way (inability to override):
> At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that.
I'll also add something that I expect to be somewhat controversial, given earlier conversations on HN[0]: I see contexts in which it would be perfectly valid to use this and ignore robots.txt.
If I were directing some LLM agent to specifically access a site on my behalf, and get a usable digest of that information to answer questions, or whever, that use of the headless browser is not a spider, it's a user agent. Just an unusual one.
The amount of traffic generated is consistent with browsing, not scraping. So no, I don't think building in a mandatory robots.txt respecter is a reasonable ask. Someone who wants to deploy it at scale while ignoring robots.txt is just going to disable that, and it causes problems for legitimate use cases where the headless browser is not a robot in any reasonable or normal interpretation of the term.
[0]: I don't entirely understand why this is controversial, but it was.
That would make it impossible to use this as a testing tool. How should automatic testing of web applications work, if you obey all of these rules? There is also the problem of load testing. This kind of stuff is by its nature of dual use, a load test is also a kind of DDOS attack.
There are so many variables involved that it’s hard to predict what it will mean for the open web to have a faster alternative to headless Chrome. At least it isn’t controlled by Google directly or indirectly (Mozilla’s funding source) or Apple.
In 10 lines of code I could create a proxy tool that removes all your suggested guidelines so the scraper still operates. In other words. Not really helping.
Not really lol, literally if you add a robots.txt check someone can just create a fork repo with a git action that escapes that routine every time the original is pushed... Adding options for filter and respecting things is great even as defaults, but trying to force "good behavior" tends to just lead to people setting up a workaround that everyone eventually uses because why use the hamstrung version instead of the open version and make your own choices.
Yes! Having long time ago done some minor web scraping, I did not put any work at all into following robots.txt, simply because it seemed like a hassle and I thought "meh it's not that much traffic is it and boss wants this done yesterday". But if the tool defaulted to following robots.txt I certainly wouldn't have minded, it would have caused me to get less noise and my tool to behave better.
Also, throttling requests and following robots.txt actually makes it less likely that your scraper will be blocked, so even for those who don't care about the ethics, it's a good thing to have ethical defaults.
We have also considered JavaScriptCore (used by Bun) and QuickJS. We did choose v8 because it's state of the art, quite well documented and easy to embed.
The code is made to support others JS engine in the future. We do want to add a lightweight alternative like QuickJS or Kiesel https://kiesel.dev/
If you support Page.startScreencast or even just capture screenshot we could experiment with using this as a backend for BrowserBox, when lightpanda matures. Cool stuff!
We did not run benchmarks with chrome-headless-shell (aka the old headless mode) but I guess that performance wise it's on the same scale as the new headless mode.
Our idea is to build a lightweight browser optimized for AI use cases like LLM training and agent workflows. And more generally any type of web automation.
It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch). So expect most websites to fail or crash. The plan is to increase coverage over time.
Happy to answer any questions.