Author here. The browser is made from scratch (not based on Chromium/Webkit), in...

JoelEinbinder · on Jan 24, 2025

When I've talked to people running this kind of ai scraping/agent workflow, the costs of the AI parts dwarf that of the web browser parts. This causes computational cost of the browser to become irrelevant. I'm curious what situation you got yourself in where optimizing the browser results in meaningful savings. I'd also like to be in that place!

I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website. But it should even out or get worse as the websites get richer. The nature of web scraping is that the worst sites take up the vast majority of your cpu cycles. I don't think lowering the ram usage of the browser process will have much real world impact.

fbouvier · on Jan 24, 2025

The cost of the browser part is still a problem. In our previous startup, we were scraping >20 millions of webpages per day, with thousands of instances of Chrome headless in parallel.

Regarding the RAM usage, it's still ~10x better than Chrome :) It seems to be coming mostly from v8, I guess that we could do better with a lightweight JS engine alternative.

radium3d · on Jan 25, 2025

As a web developer and server manager AI trainers scraping websites with no throttle is the problem. lol

cush · on Jan 24, 2025

> there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch)

> it's still ~10x better than Chrome

Do you expect it to stay that way once you've reached parity?

fbouvier · on Jan 25, 2025

I don't expect it to change a lot. All the main components are there, it's mainly a question of coverage now.

nwienert · on Jan 24, 2025

Playwright can run webkit very easily and it's dramatically less resource-intensive than Chrome.

fbouvier · on Jan 24, 2025

Yes but WebKit is not a browser per se, it's a rendering engine.

It's less resource-intensive than Chrome, but here we are talking orders of magnitude between Lightpanda and Chrome. If you are ~10x faster while using ~10x less RAM you are using ~100x less resources.

bdhcuidbebe · on Jan 25, 2025

How well does it compare to specialized headless scraper browsers, like camoufox (firefox based) or secret agent (chrome based)?

Either should reduce your ram usage compared to stock chrome by a lot.

whatevaa · on Jan 25, 2025

Careful, as you implement misssing features your RAM usage might grow too. Happened to many projects, lean at the beggining, get's just as slow when dealing with real world mesiness.

msoad · on Jan 25, 2025

Does it work nicely on Linux? I'm very curious about this

niutech · on Jan 29, 2025

How about using QuickJS instead of full-blown V8? For example, Elinks has support for SpiderMonkey, QuickJS, MuJS: https://github.com/rkd77/elinks/blob/master/doc/ecmascript.t... and takes a few MB of RAM.

Tostino · on Jan 24, 2025

You may reduce ram, but also performance. A good JIT costs ram.

fbouvier · on Jan 24, 2025

Yes, that's true. It's a balance to find between RAM and speed.

I was thinking more on use cases that require to disable JIT anyway (WASM, iOS integration, security).

Tostino · on Jan 24, 2025

Yeah, could be nice to allow the user to select the type of ECMAScript engine that fits their use-case / performance requirements (balancing the resources available).

cxr · on Jan 25, 2025

If your target is consistent enough (perhaps even stationary), then at some point "JIT" means wasting CPU cycles.

refulgentis · on Jan 24, 2025

Generally, for consumer use cases, it's best to A) do it locally, preserving some of the original web contract B) run JS to get actual content C) post-process to reduce inference cost D) get latency as low as possible

Then, as the article points out, the Big Guns making the LLMs are a big use case for this because they get a 10x speedup and can begin contemplating running JS.

It sounds like the people you've talked to are in a messy middle: no incentive to improve efficiency of loading pages, simply because there's something else in the system that has a fixed cost to it.

I'm not sure why that would rule out improving anything else, it doesn't seem they should be stuck doing nothing other than flailing around for cheaper LLM inference.

> I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website.

I'm a bit lost, the ram usage benchmark says its ~10x less, and you feel its deceptive because you'd expect ram usage to be less? Steelmanning: 10% of Chrome's usage is still too high?

JoelEinbinder · on Jan 24, 2025

The benchmark shows lower ram usage on a very simple demo website. I expect that if the benchmark ran on a random set of real websites, ram usage would not be meaningfully lower than Chrome. Happy to be impressed and wrong if it remains lower.

fbouvier · on Jan 24, 2025

I believe it will be still significantly lower as we skip the graphical rendering.

But to validate that we need to increase our Web APIs coverage.

szundi · on Jan 25, 2025

Then came deepseek

danielsht · on Jan 24, 2025

Very impressive! At Airtop.ai we looked into lightweight browsers like this one since we run a huge fleet of cloud browsers but found that anything other than a non-headless Chromium based browser would trigger bot detection pretty quickly. Even spoofing user agents triggers bot detection because fingerprinting tools like FingerprintJS will use things like JS features, canvas fingerprinting, WebGL fingerprinting, font enumeration, etc.

Can you share if you've looked into how your browser fares against bot detection tools like these?

fbouvier · on Jan 24, 2025

Thanks! No we haven't worked on bot detection.

bityard · on Jan 24, 2025

Please put a priority on making it hard to abuse the web with your tool.

At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that. It goes without saying that you should not allow users to make hundreds or thousands of "blind" parallel requests as these tend to have the effect of DoSing sites that are being hosted on modest hardware. You should also be measuring response times and throttling your requests accordingly. If a website issues a response code or other signal that you are hitting it too fast or too often, slow down.

I say this because since around the start of the new year, AI bots have been ravaging what's left of the open web and causing REAL stress and problems for admins of small and mid-sized websites and their human visitors: https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...

hombre_fatal · on Jan 25, 2025

This is HN virtue signaling. Some fringe tool that ~nobody uses is held to a different, weird standard and must be the one to kneecap itself with a pointless gesture and a fake ethical burden.

The comparison to DRM makes sense. Gimping software to disempower the end user based on the desires of content publishers. There's even probably a valid syllogism that could make you bite the bullet on browsers forcing you to render ads.

gkbrk · on Jan 24, 2025

Please don't.

Software I installed on my computer needs to the what I want as the user. I don't want every random thing I install to come with DRM.

The project looks useful, and if it ends up getting popular I imagine someone would make a DRM-free version anyway.

tossandthrow · on Jan 24, 2025

Where do you read DRM?

Parent commenter merely and humbly asks the author of the library to make sure that it has sane defaults and support for ethical crawling.

I find it disturbing that you would recommend against that.

gkbrk · on Jan 24, 2025

Here's what the parent comment wrote.

> And there should not be an option to override that.

This is not just a sane default. This is software telling you what you are allowed to do based on what the rights owner wants, literally DRM.

This is exactly like Android not allowing screenshots to be taken in certain apps because the rights owner didn't allow it.

blacksmith_tb · on Jan 24, 2025

Not sure what "digital rights" that "manages"? I don't see it as an unreasonable suggestion that the tool shouldn't be set up out of the box to DoS sites it's scraping, that doesn't prevent anyone who is technical enough to know what they're doing to fork it and remove whatever limits are there by default? I can't see it as a "my computer should do what I want!" issue, if you don't like how this package works, change it or use another?

benatkin · on Jan 24, 2025

Digital Restrictions Management, then. Have it your way.

concerndc1tizen · on Jan 25, 2025

There are so many combative people on HackerNews lately who insist to misinterpret everything.

I really wonder if it's bots or just assholes.

cies · on Jan 25, 2025

Indeed DRM is a very different thing from adhering to standards like `robots.txt` as a default out of the box (there could still be a documented option to ignore it).

concerndc1tizen · on Jan 25, 2025

- That's just like, your opinion, man

He was using DRM as a metaphor for restricted software. And advocating that software should do whatever the user wants. If the user is ignorant about the harm the software does, then adding robots.txt support is win-win for all. But if the user doesn't want it, then it's political, in the same way that DRM is political and anti-user.

JambalayaJimbo · on Jan 24, 2025

This is software telling you what you are allowed to do based on what the software developer wants* (assuming the developers cares of course...). Which is how all software works. I would not want my users of my software doing anything malicious with it, so I would not give them the option.

If I create an open-source messaging app I am also not going to give users the option of clicking a button to spam recipients with dick pics. Even if it was dead-simple for a determined user to add code for this dick pic button themselves.

concerndc1tizen · on Jan 25, 2025

> I find it disturbing

Oh no, someone on the internet found something offensive!

tossandthrow · on Jan 25, 2025

Disturbing, not offensive - it is literally right there in the quote you have been so nice to pass along.

GuB-42 · on Jan 25, 2025

Who told you about DRM? It is an open source tool.

Simply requiring a code change and a rebuild is enough of a barrier to prevent rude behavior from most people. You won't stop competent malicious actors but you can at least encourage good behavior. If popular, someone will make a fork but having the original refuse to do stuff that are deemed abusive sends a message.

It is like for the Flipper Zero. The original version does not let you access frequency bands that are illegal in some countries, and anything involving jamming is highly frowned upon. Of course, there are forks that let you do these things, but the simple fact that you need to go out of your way to find these should tell you it is not a good idea.

bityard · on Jan 24, 2025

I feel like you may have a misunderstanding of what DRM is. Talking about DRM outside the context of media distribution doesn't really make any sense.

Yes, someone can fork this and modify it however they want. They can already do the same with curl, Firefox, Chromium, etc. The point is that this is project is deliberately advertising itself as an AI-friendly web scraper. If successful, lots of people who don't know any better are going to download it and deploy it without a full understanding (and possibly caring) of the consequences on the open web. And as I already point out, this is not hypothetical, it is already happening. Right now. As we speak.

Do you want cloudflare everywhere? This is how you get cloudflare everywhere.

My plea for the dev is that they choose to take the high road and put web-server-friendly SANE DEFAULTS in place to curb the bulk of abusive web scraping behavior to lessen the number of gray hairs it causes web admins like myself. That is all.

randunel · on Jan 24, 2025

It's exactly DRM, management of legal access to digital content. The "media" part has been optional for decades.

The comment they replied to didn't suggest sane defaults, but DRM. Here's the quote, no defaults work that way (inability to override):

> At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that.

samatman · on Jan 24, 2025

I'll also add something that I expect to be somewhat controversial, given earlier conversations on HN[0]: I see contexts in which it would be perfectly valid to use this and ignore robots.txt.

If I were directing some LLM agent to specifically access a site on my behalf, and get a usable digest of that information to answer questions, or whever, that use of the headless browser is not a spider, it's a user agent. Just an unusual one.

The amount of traffic generated is consistent with browsing, not scraping. So no, I don't think building in a mandatory robots.txt respecter is a reasonable ask. Someone who wants to deploy it at scale while ignoring robots.txt is just going to disable that, and it causes problems for legitimate use cases where the headless browser is not a robot in any reasonable or normal interpretation of the term.

[0]: I don't entirely understand why this is controversial, but it was.

benatkin · on Jan 24, 2025

> Talking about DRM outside the context of media distribution doesn't really make any sense.

It’s a cultural thing, and it makes a lot of sense. This fits with DRM culture that has walled gardens in iOS and Android.

calvinmorrison · on Jan 25, 2025

i still wont forgive libtorrent for not implementing sequential access.

and also, xpdf for implementing the "you cant select text" feature

MichaelMoser123 · on Jan 25, 2025

That would make it impossible to use this as a testing tool. How should automatic testing of web applications work, if you obey all of these rules? There is also the problem of load testing. This kind of stuff is by its nature of dual use, a load test is also a kind of DDOS attack.

benatkin · on Jan 24, 2025

Make it faster and furiouser.

There are so many variables involved that it’s hard to predict what it will mean for the open web to have a faster alternative to headless Chrome. At least it isn’t controlled by Google directly or indirectly (Mozilla’s funding source) or Apple.

mpalmer · on Jan 25, 2025

If it's already a problem, nothing this developer does will improve it, including crippling their software and removing arguably legitimate use cases.

buzzerbetrayed · on Jan 25, 2025

Nerfing developer tools to save the "open web" is such a fucking backward argument.

holoduke · on Jan 25, 2025

In 10 lines of code I could create a proxy tool that removes all your suggested guidelines so the scraper still operates. In other words. Not really helping.

cchance · on Jan 25, 2025

Its literally open source, any effort put into hamstringing it would just be forked and removed lol

xena · on Jan 25, 2025

Any barrier to abuse makes abuse harder.

cchance · on Jan 26, 2025

Not really lol, literally if you add a robots.txt check someone can just create a fork repo with a git action that escapes that routine every time the original is pushed... Adding options for filter and respecting things is great even as defaults, but trying to force "good behavior" tends to just lead to people setting up a workaround that everyone eventually uses because why use the hamstrung version instead of the open version and make your own choices.

internet_points · on Jan 25, 2025

Yes! Having long time ago done some minor web scraping, I did not put any work at all into following robots.txt, simply because it seemed like a hassle and I thought "meh it's not that much traffic is it and boss wants this done yesterday". But if the tool defaulted to following robots.txt I certainly wouldn't have minded, it would have caused me to get less noise and my tool to behave better.

Also, throttling requests and following robots.txt actually makes it less likely that your scraper will be blocked, so even for those who don't care about the ethics, it's a good thing to have ethical defaults.

andrethegiant · on Jan 25, 2025

This is why I’m making crawlspace.dev, a crawling PaaS that respects robots.txt, implements proper caching, etc by default.

bsnnkv · on Jan 25, 2025

Looking at the responses here, I'm glad I just chose to paywall to protect against LLM training data collection crawling abuse.[1]

[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...

sesm · on Jan 24, 2025

Great job! And good luck on your journey!

One question: which JS engines did you consider and why you chose V8 in the end?

fbouvier · on Jan 24, 2025

We have also considered JavaScriptCore (used by Bun) and QuickJS. We did choose v8 because it's state of the art, quite well documented and easy to embed.

The code is made to support others JS engine in the future. We do want to add a lightweight alternative like QuickJS or Kiesel https://kiesel.dev/

ksec · on Jan 25, 2025

Thank You I was thinking of JSC and Bun as well. Was half expecting JSC since that combination seems to work well.

keepamovin · on Jan 25, 2025

If you support Page.startScreencast or even just capture screenshot we could experiment with using this as a backend for BrowserBox, when lightpanda matures. Cool stuff!

https://github.com/BrowserBox/BrowserBox/

returnofzyx · on Jan 25, 2025

Hi. Can I embed this as library? Is there C API exposed? I can't seem to find any documentation. I'd prefer this to a CDP server.

fbouvier · on Jan 25, 2025

Not now but we might do it in the future. It's easy to export a Zig project as a C ABI library.

returnofzyx · on Jan 25, 2025

Oh please do. I'm sure there are many people like me who want this.

niutech · on Jan 29, 2025

Congratulations! But does it support Google Account login? And ReCAPTCHA?

afk1914 · on Jan 24, 2025

I am curious how Lightpanda compares to chrome-headless-shell ({headless: 'shell'} in Puppeteer) in benchmarks.

fbouvier · on Jan 24, 2025

We did not run benchmarks with chrome-headless-shell (aka the old headless mode) but I guess that performance wise it's on the same scale as the new headless mode.

toobulkeh · on Jan 24, 2025

I’d love to see better optimized web socket support and “save” features that cache LLM queries to optimize fallback

dtj1123 · on Jan 24, 2025

Very nice. Does this / will this support the puppeteer-extra stealth plugin?

katiehallett · on Jan 24, 2025

Thanks! Right now no, but since we use the CDP (playwright, puppeteer), I guess it would be possible to support it

867-5309 · on Jan 25, 2025

does this work with selenium/chromedriver?

fbouvier · on Jan 25, 2025

For now we just support CDP. But Selenium is definitely in our roadmap.

xena · on Jan 24, 2025

How do I make sure that people can't use lightpanda to bypass bot protection tools?

dolmen · on Jan 27, 2025

One of Lightpanda's goals is to ease building bots.