Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Full Time (marginalia.nu)
932 points by kevincox on June 16, 2023 | hide | past | favorite | 148 comments


That feeling when you walk out of an office for the last time, to work on your own thing is exhilarating. I had my moment like that back in 2014 and can still remember it.

Congrats to Viktor and good luck!

Going to go and try your search engine now.

Previous discussion of the search engine a couple of months ago: https://news.ycombinator.com/item?id=35611923 (196 comments)

Many other posts and blog posts over the last couple of years: https://news.ycombinator.com/from?site=marginalia.nu


It was March 2006 for me, and I haven’t worked in an office since. What a great feeling.


I went the other way, in 15 year career never worked a day in the office until last week. So far I can WFH but dunno for how long.

There are positives with an office but I kinda envy how everyone celebrates the opposite move


What you are doing since 2006?


One year ago for me. I’ll never work for a boss again. And won’t be the boss of anyone again either - just as important to me.


What do you do then? Live off the interests of some fortune?


I am currently working all the time. Independently designed/developed/operated/owned apps, and worker coop structures for working with others. I like consent based decision making rather than hierarchical or consensus. Command is unnecessary for us to do well for ourselves.


I would love to (though I'm a long way off it, with not much to walk away to) but I wonder what the equivalent feeling is if you already/previously work from home? Shipping the work machine back? Turning it off for the last time? Unplugging web cam and microphone?


Perhaps removing Slack. What a feeling to never have to be online constantly from 8-5 so no one will think I’m not working.


I've been trialling Marginalia Search a little and one thing that's struck me is the latency. The only other site I use with similar latency is HN; Marginalia seems even lower despite being dynamic (HN has a much easier caching story). I wonder is it just down to having lower traffic. It's certainly a lot lower than many low-/zero-traffic blogs I've frequented though.

I've had a look at the README[0] for the Java sourcecode, but it's highly focused on crawls, database & indexing (understandable for search); would be cool to see a front-end focused write-up.

[0] https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...


The blog is just hugo so it's 100% static files over nginx.

The search engine is serverside-rendered mustache templates via handlebars[1], via served via spark[2]. It's basically all vanilla Java. I do raw SQL queries instead of ORM, which makes it quite a bit snappier than most Java applications. The sheer size of the database also mandates that basically every query is a primary key lookup. The code is written around that constraint.

Although the search engine is a bit on the slow side since it's routed through cloudflare and I think I'm relatively far away from the closest datacenter so it adds like 100ms to the load times.

[1] https://github.com/jknack/handlebars.java

[2] https://sparkjava.com/


> I do raw SQL queries instead of ORM

Love it. I've seen so many cases where engineers with just basic SQL knowledge (like myself; I'm no JOIN god) can run circles around the queries ORMs generate.


ORMs are great to spare you from writing heaps of error-prone boilerplate mapping code.

Which is kinda what they are for, that's why they're called "Object-Relational Mappers". Not "Object-Relational Query Generators". Because they suck at the latter.


ORMs help you the first three weeks. After that it is a lead ball around both feets, dragging you down into dark pits of bad performance, incomprehensible relations, and impossible debugging.

That anyone with more than 6 months of experience still drag Hibernate up from their chest is just absolutely beyond my comprehension.


I think if you're doing basic CRUD operations on small tables with relatively simple relations, then ORM is just fine. This is to be blunt what a lot of applications do, and so a lot of them justifiably use ORM.

That said, the moment you leave the small table simple relations space (by e.g. having a table with a quarter billion rows), then ORM is not a good choice.


Usually they’re not even really relational mappers, but table mappers.


Not sure what you mean: (in a rdms) a table is a relation.


Yes exactly. ORMs are usually only good at representing tables, not general relations.


> It's basically all vanilla Java

Have you considered rewriting it in rust? ;)

In all seriousness, it's great to see something written in a "boring" language like Java, which seems to get a lot of hate in developer circles, hover at the top of HN.

Java really can perform amazingly well, especially if you minimise the use of unneeded libraries and frameworks. Super curious to see how your stack evolves as you get more load.

Best of luck to you on the journey!

Ps there's truly a world of difference between "Spring Boot Developer" and "Software Engineer with Java experience". I suspect a lot of people who hate Java or think it performs badly have only worked with the former group of people.


I'm a believer in the conservation of cool. You can build something cool in a boring language, or build something mundane in a cool language.

Java gets a lot of shit, some of it is merited but a lot of it isn't really fair to the language. There's a lot of Java developers that are kinda shit-tier copy-paste developers developing shit-tier copy-paste applications, because the language is so forgiving as to accommodate that, but it's also a competent language that you can do seriously impressive things with.

You can be insanely productive in Java because it's extremely stable and mature. You almost never have to deal with library churn or other upstream changes that urgently needs to be fixed. I can think of exactly two instances in my professional life that's happened, migrating off Java 1.8 and the oh-shit moment of needing to patch log4shell.


Over 10 years of professional java (right from ejb madness of early 2000s through the last few years of spring annotation hell) taught me the balance to using java in personal (or startup) projects is simply to stick to core java features and the bare minimum libraries and being very skeptical to the hot air coming from outside the community.

My code from late 2000s, with very few modifications to keep up with modern java syntax, writren in bare java talking to postgres in just raw sql runs circles around anything new I've tried or built with for modern web application backend stacks.

IDE support is top class. Thread are just awesome when done right. Static types make code incredibly easy to read and reason about even years later. JVM has been super stable forever. A whole lot of features I need are just baked right into the language, but not obvious at first. Mvn just works. And my reluctance to external libraries actually made me write the logic myself making me much better understand related concerns.

Congratulations and I wish more cool projects picked Java as their language, but I see them use Oracle's ownership as the strawman argument against it. I don't know enough about that.

Personally, I owe a lot to Java.


Haha, that's fairly similar to my career as well. It really is severely slept on as a language.



Which version of Java are you running?


I'm on 18, but will upgrade to 21 when it drops. Not really felt the need to upgrade to 19 or 20 as they haven't provided any features I'm interested in.


> The blog is just hugo

Yeah the static stuff being fast is less surprising - it was mainly the search results page that astounded me.

> via served via spark[2]

Had not heard of this Spark (only the other Apache one). Will definitely take a look.

> Although the search engine is a bit on the slow side since it's routed through cloudflare and I think I'm relatively far away from the closest datacenter so it adds like 100ms to the load times.

I've hit the CF loading screen which introduces a big delay, but when I don't see that the loading is really instantaneous.


I think overall the system is just really well optimized. It needs to be given I'm working with finite hardware.


It's incredibly impressive. Well done.


Watch out for Spark. If not dead, it went into some kind of hiatus. Little activity recently.


Yes, and it was not that well designed to be honest... the successor is quite a lot nicer and it's called Javalin[1].

Same philosophy but just got things right where Spark, being the "first" (in the Java world, using the design inherited by Sinatra[2]) had a few design issues.

[1] https://javalin.io/

[2] https://sinatrarb.com/


Dunno what I'd want to change though. If worse comes to worst, I'll fork it and keep the dependencies up to date.


For anything handling user input I'd be concerned about maintenance status for fixes. Even beyond the codebase itself, even just maintaining an up to date pom.xml can be important - seems theirs was last updated in July of last year. Very brief manual browse of it shows potential exposure to things like https://nvd.nist.gov/vuln/detail/CVE-2022-25647 - not sure if that's reachable in the codebase but there could be others.


That seems to be the status of the whole hadoop ecosystem unfortunately (we're switching away from it at work).


Does the library have anything to do with Hadoop? Are we talking of the same Spark?


https://en.wikipedia.org/wiki/Apache_Spark right? My understanding was it was built on Hadoop (https://en.wikipedia.org/wiki/Apache_Hadoop) infrastructure.


Then it's something else.


Curious: Do you have plans to bring in FOSS LLMs for summarization and Q&A style queries anytime in the coming months?

Btw, I was half expecting that you quit because of FUTO grants (saw your post on their forums), but I guess it wasn't that. Either ways, rooting for you!


Not in the short term, if for no other reason than not having the hardware for it. Maybe down the line. Would be neat if someone who was into that sort of stuff wanted to integrate with Marginalia though. I've got a free API ;-)

I do think LLMs has the potential to integrate well with search engines and I'd love to see a sort of open source search ecosystem emerge with different projects collaborating to exceed the sum of their parts.

Yeah I was in talks with FUTO and they agreed to help the project out a bit, I won't say more until I have money in hand. It's taking a while but it's not on their end, I just need to sort out some legal stuff first.


How does HN's latency compare for you if you're logged in vs logged out?


if you are logged in, it has to check if you upvoted some of those listed items (submission and comments). If you're logged out, it doesn't need to check anything - so it's faster


In fact, if I remember right, when logged out it can serve cached, pre-rendered pages. Sometimes when HN is down or underperforming, clearing your cookies or opening in incognito will still allow you to view the site because the cache is still present.


I'm always logged in so I haven't got a good impression of the difference but it seems lightning fast most of the time when logged in, so it can't be too much of a difference I guess.


Algolia has stunning latency and I assume a bucketload of traffic, I suspect they just have very competent infrastructure and fast as hell code and queries, perhaps thr same is true here.


I appreciate that this is a glass half full or empty kind of situation, but I tend to see "fast" setups actually simply not doing a bunch of the mostly unnecessary stuff that other solutions are doing. This is especially true for code, where of course we're going to get bad results:

We've over-abstracted everything with our own abstractions and/or libraries that are as bad as our own stuff and we're doing this in a language and runtime that is pretty much one big premature pessimisation to begin with because we could do significantly more with less. We have no idea what the GC is doing and when, and we care less about that than "using functional programming" or some other equally pointless-in-itself principle that you would be better taking very little from.


Yeah this is definitely true for Marginalia. It does exactly what it needs to do to serve the page and very little else. There's also no superfluous scripts in the frontend, no session cookie, no user tracking. That stuff does add up.

Many applications do so many redundant calculations it's almost absurd. Professionally I've seen applications do enormous ORM lookups that fetch thousands of objects, stick them into a hash table by some key, and pick one value and compare against some parameter; and then do this operation in four different places in processing a single request. Gee I wonder why we have 800ms page loads...


I totally agree, usually simple is fast. Get the simple stuff right and scale can be simple too.

Of course sometimes a complex problem is difficult to solve simply, and sometimes being verbose is a good tradeoff to help maintainability.


That's surprising to read. HN has always felt fast to me.


I believe they are saying that Marginalia is also very fast.


sibling comment is correct: I was saying that HN is very fast and the Marginalia search engine results page is even faster.


Congratulations with your courageous step. I will really root for it, might even check out if I can contribute. I think Marginalia can have an amazing impact to the web. Right now it is dying from the cancerous growth of SEO spam and informations silos ever increasing in size.

I tried Marginalia and already get amazing and fast results. This will make the web fun, creative and interesting again.

Just like, I think, fellow countryman proving the world wrong that browsers cannot be created from scratch with Ladybird I think you will succeed also. (At least with search engines the competition gets worse every day.)


Best of luck. Easily my favorite project. Emailed Viktor last year about using the marginalia API for my side project[1] and he responded almost immediately. I use the API to get marginalia's arcane search results for a given query and choose a random link from those results to redirect. Endless fun.

Hope to see it continue to grow until the internet goes dark.

[1] https://moonjump.app/


That's fantastic!

A sort of indie internet discovery ecosystem effect is one of the things I've really been hoping to accomplish with Marginalia.


I had seen marginalia mentioned here in HN a couple of times but never got around to use it.

I'm very impressed. Using it I get this old Internet vibe (which someone else also mentioned). Just used it to get some information on a random topic I recently tried to research with Google but failed due to all the SEO crap. It produced several hits of old pages (with the tiny font and the early 2000's graphics and design), but _full_ of information.

Not all the results are good though, it was mostly hit and miss, but the hits were _good_. Will use it from now on.


I'm so happy to hear this!

@marginalia_nu if you're reading this, please know that you're an inspiration and that I crazy appreciate what you're doing.

We need people like you in the world pushing to make interesting things that aren't necessarily profit driven, but instead seek to help add flavor and interest back into the world.

Your search engine is the kind of technology that reminds me of the technology ethos from the 90s and it's so amazing to see you get the chance to actualize it! Don't waste this chance!

No matter what, know that this is the right decision. You have fans, and we're rooting for you!

Thank you for making cool stuff!


Aww shucks, thanks!


I'd love to support, but Patreon link [0] on "supporting" page [1] is 404.

Is another support option in progress to replace?

[0] https://www.marginalia.nu/marginalia-search/supporting/patre...

[1] https://www.marginalia.nu/marginalia-search/supporting/


The links were cropped so I changed them with just a word for the service. But it turns out I can't markdown today and I changed the URLs instead of the text.

Fixed now.


Question for the tester type software engineers on HN...

I like to write browser (puppeteer) tests for user-facing software criteria like "patreon link must work." In the past I've written similar tests for small websites I've created where the purpose is to surface affiliate links for users to click on. My criteria is "from a money standpoint, is this the call to action I want my users to engage in?"

I don't know what type of test this is -- can anyone disambiguate testing terminology for me?

P.S. Browser based testing is brittle but since I often create websites and because I want to really ensure that I'm not 'lying' to myself in tests, using a browser is often the best (albeit slower) choice. These tests usually run in CI and I get notifications if they break.

P.P.S. I wish we had a better mental model for the types of tests than the "testing pyramid." I find the testing pyramid lacking.


> I wish we had a better mental model for the types of tests than the "testing pyramid." I find the testing pyramid lacking.

I have a hunch that every pyramid model is bullshit. It's inherently appealing to present any sequence of things as a pyramid, regardless of whether it makes sense.


Puppeteer, Cypress, (once upon a time Selenium), etc are end to end or e2e


I would probably call these integration tests.

I came across this interesting tool for similar tests the other day. It lets you request websites or API’s and then search the return for a string. It’s more for checking uptime, so I dunno if it would be acceptable for this type of test, but it looks like a cool tool.

https://onlineornot.com


Just calling it "automatic test" will do. Or end to end where you have an automatic test acting as a user.


Seems like the link text is correct (https://www.patreon.com/marginalia_nu) but the linking functionality is missing.


Great! Wishing you good luck.

What I find quite nice about Marginalia is for discoveries outside the most popular destinations for such topics. For example, looking for a weekend movie but do not want to see all the SEO websites talking about movies. Marginalia surprises you with some unknown websites in the first page :) I use it when I want to be surprised by the results :D


Just gave it a shot and this seems really interesting!


I left a very good job 7 years in (digital design) to go out on my own. That was more than 2 decades ago. I could write paragraphs of the rookie mistakes (business-wise) and the financial ups & downs, but one thing has never changed...

The "temporal freedom" I have in my work (Gad Saad, if you don't know the name). I love being the master of my own day, of my own time. I don't sit in Zoom meetings or have daily standups. I can get up at 5am and work until 11am, and then go hike, play with my dog, get ice cream with my daughters, workout, etc. and then work again from 7pm until midnight or whatever.

Having (almost literally) full control over my daily schedule, week-in, week-out, year after year, is invaluable to me.

One disclosure: a few times a year I do very hard things where I have very little freedom, but they allow me to have lots of freedom the rest of the year.

Not to be a jerk, but I won't be elaborating. And I realize this life isn't for everyone!


Sounds perfect :)

I wonder how many great works we'd have built if most weren't trapped elsewhere.


Im curious what percentage of living expenses are covered by the project for the author. I have a few products generating over 50% of my yearly expenses and am feeling like going full time is almost a possibility now.

A bit too nervous to pull the trigger just yet


Author here o/

In general I've had like infrequent but large influx of money from the project, so it's hard to answer. Although I have relatively long runway, no small thanks to nlnet for their generous grant.

On some level it's all a gamble. Either I try to make this work somehow, or I close up shop and keep working as an office drone, because I really can't keep doing both.

My hope is that I'm able to make it work on a wikipedia-like model donation model, maybe supplemented with selling commercial API access (access is free CC-BY-NC-SA). My burn rate is literally my living expenses plus a hundred dollars per month of service costs to I don't have to be spectacularly profitable to sustain flight. ... all that is contingent on making it work quite a lot better than it does now, so I guess I have my work cut out for me.

It's also a weird project, since it's had an almost absurdly positive reaction. For example, many people develop a search engine and get almost lynched on HN for not working exactly like Google or not dealing with some query as expected. Someone found a link to my barely working search engine that didn't properly support multiple-keyword queries and this happens: https://news.ycombinator.com/item?id=28550764


It's also a weird project, since it's had an almost absurdly positive reaction. For example, many people develop a search engine and get almost lynched on HN for not working exactly like Google or not dealing with some query as expected.

I don't know you personally, but you come across as an earnest lone developer doing something for the passion of it. I think that goes a long way on here, versus someone giving off "portfolio project", "hire me" or "seeking investment" vibes. I've not really found a use case for your engine yet but I am really enjoying seeing your progress.


It's not just on HN either. The project was mentioned in The New Yorker and I've done interviews with German radio. Just the weirdest stuff's been happening since basically day one.


Link for anyone looking the New Yorker article:

https://archive.ph/iIwtV


It has a nostalgic feel about it. Not just the visual design, but how it wont answer questions but it will look for terms. Sometimes you want a less algorithmic engine. Takes me back to my first messing around with dialup in 1994.


In case you want to explore additional ways to extend your runway, there is the STF (Sovereign Tech Fund) https://sovereigntechfund.de/en/challenges/ where they claim to offer €65,000 up to a maximum of €300,000 in funding to FOSS projects.

I have no affiliation but recently came across them from a weekly newsletter (via https://changelog.com/news/48/email).


Thanks, nice lead!



I do find it a bit strange you "punish modern design", while your own design is very hard to read. I'm not sure you made up that quote, or someone on HN did.

It's very hard to read your search results. I've always disliked grid views to represent data. It's very hard to find what you want.

Im not sure. But it looks like you didn't want to copy google and wanted to make something "authentic", same reason why often modern design is unusable.

Every competitor of Google just gave up trying finding a better sexier way. DuckDuckGo, bing etc. Pure copies. A list view, with a good contrasting header is the best way to scan and find the results you want.

If you want to keep it, at least provide a list / grid switcher so users can pick themselves.

Good luck! Happy you get to pursue your passion.


Yeah I'm not a huge fan of how the magic the gathering layout has turned out. Been experimenting with something more list-like, e.g. https://twitter.com/MarginaliaNu/status/1644058334440443916

I don't like the basic old school google style list though. It makes very poor use of the screen space. This is primarily a service for desktop users finding desktop content, but I still want something that's accessible to other screen sizes. Really hard to find a good design that works well.


For whatever it's worth, I personally like the screenshots of the pages that shows up when you browse random; I think it really helps in recognizing a site you may have been to before. If there were a way to incorporate that into all search results, along with a more information dense listing, I for one would find that quite useful. Kind of a 'I can't remember what it was called, but I'd recognize it if I saw it' sort of thing.

I also really appreciate the desire to use available screen space. It irks me to no end when a site forces a narrow column of info/content and wide empty borders wasting half or more of my screen. Wikipedia recently started doing this and I can't say they're better for it in my opinion.


Just echoing this. I was looking for a site the other day, and I thought I'd use marginalia since it throws up interesting stuff in general, and the site I was looking for had a distinctive look that I knew I would recognise again ... and was disappointed the "magazine stand" view was only for the random sites.

I do like that feature.


As far as layout is concerned, if you don't mind me brainstorming some ideas, I'll share some thoughts.

When a search term yields many results, it's left to the user to the user to search the results for the site that will yield the "best" match for what they're after. It seems like people assume that the better the search engine is, the better it is at predicting what the user is really after by putting it at the top of the listing. But this can be rather difficult when the original search terms are pretty generic and the user is required to scroll and check many results. If there were a way help the user sort the results based on relevant criteria, maybe that would make that search easier. And personally, I like things that give users a little more say in how they get fed information. Allow sort by popularity, frequency of search terms in page, number of pages in site's domain, date of last page edit (no idea if this is possible to get), etc...

Maybe have multiple columns of search results. One column that lists results that match all words in the query, another for only one or two words. Or maybe columns that list results that include the user's query plus likely related topics. Or a set of search refinement tools that can further help the user sort based on any number of criteria, or filter results by specific related terms.

Slightly related, I really like your encyclopedia site. In addition to being incredibly nice to use all on its own, perhaps it (from the 'See Also', 'Further Reading', 'Related articles', etc... sections) could be mined for suggesting additional search terms/info a user could add to their search or filter their results by. For example if I search for Tcl and get a bunch of results, some tools that suggested filtering (or a search instead option?) the results to those that included Tk, expect, and TclX might help me get to what I'm after quicker.

No idea if any of that is practical or would even actually be that useful in practice.


I like the list view on desktop, I would maybe make the title slightly larger to have a stronger contrast with the description.

They are not my colors, but the contrast is clear!

Mobile I think the cards are to high. Slightly smaller font, and cutting of after 2-3 sentence a read more link would probably make it easier to sift through your results.

But just my random opinion, good luck!


We have more horizontal than vertical space, try to utilize that without stacking search results next to each other?


If nothing else, you could open with just a Patreon or something. Basically as a way of outsourcing the "subscription revenue" implementation until such time as something direct yourself makes sense.


I do have a Patreon, but I guess people aren't finding it and/or have ad-blindness to the words 'donate' and 'patreon' ;P


Please sell something business-like that people can purchase and expense.

A book, software, something. I can't quite expense patreon and others may have a similar issue. (Useful "free SaaS" where all there is is the cup of coffee button makes me sad).


I second this.

My buyer won't even blink if I say that I need a $150 tool: I can just bill it to whatever project it's being used for as long as I get an invoice or a receipt or some kind of documentation. If I say that I found a free tool and I'd like to donate $10 to the author, no one will know how to do that.


The links seem broken. On https://www.marginalia.nu/marginalia-search/supporting/, when I click the Patreon and Buy Me a Coffee links, they go to:

https://www.marginalia.nu/marginalia-search/supporting/patre...

https://www.marginalia.nu/marginalia-search/supporting/buyme...

(the text of the links is correct though)


It's fixed now.


I recommend saying 'Patreon' instead of 'Donate' on the site's main navigation menu! It does have a stronger effect because they'll associate it with a human being behind the screen.


You have my pittance! Your search engine is useful to me for recipes and that crazy cyberpunk network of back-alley Geocities-esque pages it's tapped into


I pulled that trigger later than I could have, I was earning 2x my salary from my side project before I quit.

At 50% if you can see an upward trend, ~6 months savings, and have a plan that the time will give you to execute, got for it.


The "main thing" is how hard it would be to get back in the business (i.e, get a job) if the whole thing explodes.

Also if you're going to quit anyway, you might as well ask the company you currently work for if they'll let you go on sabbatical, or part time, or consulting.

That can give you a bit of extra runway/feeling of security.


I suppose you can go back to any of the previous points in your CV as a software professional in today’a day and age if you never burned any bridges. Especially so if you make it obvious in your current job that you’re only leaving because it’s time to try your own thing - if it doesn’t end up working, people are likely to be very understanding.


It's not like I'm doing nothing for these upcoming years.

Dunno what you're doing wrong if you can't land a job with a built-from-scratch internet search engine on your resume.


But can you pass LeetCode :-)


In general all you need is an explanation for a break in work history and there are billions that will satisfy interviewers and HR; at worst just say “health reasons” and then sue when you don’t get the job ;) (/sarcasm)



This is fantastic and commendable. Too many good hackers are tied up in a stable job at companies. Building something out of passion is just so different and the end result is so amazing that one cannot really fake that.

Marginalia is one of my favorite sites. Wishing you all the best.


Cool that you're living your best life. Every time I leave a company, I think about the ending of The Prisoner:

https://youtube.com/clip/UgkxgJAzCqKOL5yMg39wmtZi52tw8LAXOEr...


Great pull. I've always wanted the courage to slam my resignation down on my boss' desk and yell at them, so a different Prisoner scene for me.


Both of your top two projects are very interesting to me at the moment. Especially your Wikipedia mirror.

Just today I realized how distracting too many hyperlinks can be. And Wikipedia is full of them! It feels so much easier to read an article without them. Now I just wish Wikipedia had more supporting graphics to help engage readers in a more productive manner.


This is one of my favorite HN adjacent projects and I use it with some frequency. Glad to see you are committed to it for the long term. Good luck.


It's a great feeling to leave and set off on your own journey.

Beyond the feeling, it's also educational as you learn about your deficients quickly (or, in some cases, too slowly).

I'm wrestling with this now as I'm building my platform and looking to pivot into something that produces revenue.


Congrats! I find I'm using Marginalia more and more, it's especially great for researching for novel writing, and can’t wait to see what the future holds! Good luck!


Tiny bit of feedback, your encyclopedia favicon seems to 404:

https://encyclopedia.marginalia.nu/favicon.ico

Otherwise - Great job on the peppy site and breath of fresh air to open the network tab and see 1 html get, and another for the CSS. And the 404 favicon that I guess the browser insists on ;)


I’m curious about what and how crawling is done. I did a search for my own site and didn’t find it (it’s a redirect to another site, which I’m sure doesn’t help). What’s being indexed right now (out of curiosity, not trying to game SEO - that’s why I’m not mentioning the site I searched for here.)


I don't think it's supposed to index all sites. If you search for Twitter, Facebook, Instagram or even Hacker News you will not get any official results. It's meant to only show obscure sites but I'm unsure of the actual criteria.


Congrats to the author! Marginalia is a great service. I hope they find a way to make it viable to keep going, either through donations or some other model.


> I gave it a shot, for no other reason than not being able to quite figure out why this supposedly impossible thing was impossible. Doing the napkin math, it seemed very possible.

I thought this too! So happy someone has tested the assertion.

Good luck! I’ve had Marginalia bookmarked for some time but this story will remind me to try it.


Given that this is written on Java and running on a single server with fixed hardware....

Is there and what is the "peak" amount of optimization feasible in Java for this search engine before one would need to turn to C/Rust/etc to get any more performance out of this on the given hardware?


Java's main limitation is probably in access to lower level I/O APIs, as well as vectorization support that is somewhat lackluster. There's almost definitely performance left on the table.

It's relying quite heavily on memory mapped I/O and doing some clever things to work around language limitations in how much you can memory map at a time. This permits surprisingly good but not optimal performance.

A bigger drawback is that this type of low level programming in Java is a serious pain in the ass.


Presumably there is a peak, but Java can be really, really fast.

I recently rewrote a heavy algorithm from Java to Rust, thinking that I'd get faster performance pretty much automatically. It turned out to be significantly slower than my optimized Java algorithm, and I didn't have the experience to tune the Rust version, so I ended up sticking to Java for now.

I'm sure someone who knows how could have tuned the Rust version to get better performance, but native code is not my specialty and the Java version was doing fine.

A warmed up JVM is a lot faster than most people think, especially for a long-running app like a search engine.


Working on my own project has easily become one of the most fulfilling things I’ve ever done in my life


You have inspired me for today, I appreciate it.

Congratulations on cutting loose, always a great feeling.


Congratulations! If you or anyone else knows of communities of likeminded people, please share :) I.e. excited about doing their own thing that they’re passionate about, but not all about the VC rat race


Does Marginalia Search have selling points over other search engines? Different features or philosophy?

First I'm hearing of it.


From the about: "This is an independent DIY search engine that focuses on non-commercial content, and attempts to show you sites you perhaps weren't aware of in favor of the sort of sites you probably already knew existed. "

I personally find it hard to put into words, but the old internet and old search engines had this feel to them that you never knew what you were going to get. Each site looked different. Each site had it's own philosophy of content and design. Everybody was winging it. It just felt more personal and interesting. At the risk of hyperbole, now it seems search engines give back mostly SEO blogspam that all looks the same.

Marginalia feels more like the former internet.


> now it seems search engines give back mostly SEO blogspam

And with technical questions, too many results are not correct. It seems that Google search is really going down hill in this regard. I'd like a way to vote down results that are obviously SEO trash, but I'm pretty sure if that were provided, it would be gamed too.


It's something like Google but what it kept working like it did in 2002 and then added a bunch of discovery features.


Feedback: I've tried the search engine a couple times and just bounced because there were no results for my queries, and it's hard to think of new queries that actually matter to me when I'm just trying something

Hoping this turns into something magical. I dearly miss the old web!


What queries did you try?


The one I just tried was "urbit key rollover"

Upon closer inspection, I think there's something disconcerting about using this engine in that—I don't know the right words for this but—it filters 100% by all the keywords. So, "rollover" is kind of a strange word, and because you don't have any results with these three words, I see nothing.

I'd prefer to instead see results for "urbit key," given the circumstances. I imagine the algorithms to do this well are complicated though.

Another query that had no results: "multi-band compression maselec"

A query that has surprisingly few results: "qmmf-4." It shows a single forum post from a forum that has probably hundreds of posts matching this query. Why just one?


Wow, thank you for the write-up! For me this is an endorsement of Marginalia, it seems to work exactly like it should... Like Google before it got "smart", or like AltaVista. Love it!


> it filters 100% by all the keywords.

IOW how queries ought to work. I am perfectly capable of changing the precision of my query to affect recall; I prefer that no "algorithm" ruins that ability.


Interesting. I searched for "what is jit oauth" and got no results.

I did end up getting results with "jit oauth" (quotes not in search), but not great ones.

To be fair, google didn't give me great results for either of those queries either

edit: What I was looking for was related to JWTs, not necessarily oauth, and the actual claim in spec is "JTI" (but I believe a service whose traffic I was inspecting used "jit" instead)


Back when people-facing computer stuff worked more like computer-facing computer stuff, queries were simple: keywords to match. If you get too many results, add more keywords; if you get too few, remove* keywords (or change existing keywords to be more specific/more general respectively).

If you're not searching for literal occurrences of "what" and "is", why should they even be in a query?

* this is an instance of an adjunction, which is an important concept in informatics, but I understand that to actually admit the fact would probably be the kiss of death for anything claiming to be a "user" interface.

Lagniappe: marginalia, upon being queried with "precision" and "recall", came up inter alia with http://comonad.com/reader/2009/remodeling-precision/ , which I count as a win for it.


It's a balancing act. I think Marginalia is a bit too literal with the search terms right now, and Google is way too far off in the other end of the spectrum.

Query understanding is one of the things I'm hoping to address this year. It's very crude right now with some pretty obvious low hanging fruit in the scientific literature that's ripe for implementation.


What's annoying about this design, when there's such a small corpus, is that I need to search for many permutations of a query in order to find good results.

Without a feature like this, I want to have another feature that allows me to search for multiple queries at the same time and interleave the results.

But I also understand the advantages. C'est la vie.


Congratulations! I identify with this post a lot. Good luck! Your actions are certainly an inspiration


Congrats!

Going full time is the only way to go for a project you love and want to grow.

What is the business model?

How much visitors does Marginalia have?


My main question is how you kept focused and motivated to work for two years on this project, especially at the beginning when no one was aware of it. What steps did you take to make sure you kept motivated?


It's just an interesting project. Not really struggling with motivation as a result. It's also huge in terms of scope and ambition, so I can always take a break and take a weekend and go build something else within the scope of the project and still move it forward.


not knowing what comes in the next 12-24 months is exhilarating because you're actually living life with an acceptance of the nature of reality instead of deluding yourself or fighting it.


Is there a reason why stackoverflow isn't in there at all?

EDIT: Also cool project!


I have had it indexed in the past, but I don't at this point. It needs a special treatment since it's not really feasible to crawl, and you have to load it from their xml dumps instead.


I don't know it could be my perception but page loads on the the search page feel almost instantaneous. Curious to know about tech stack and especially underlying infrastructure.


It's vanilla Java on non-virtualized/non-containerized on-prem hardware, a single server (actually just a big PC with an enterprise MVMe SSD). The search index is bespoke, but I'm using mariadb for the link database.

It's a simple macroservice architecture.


It's actually a bit disconcerting to me because it feels like it didn't do anything. :)


Congratulations, I hope the monetization won't kill it


My hope is that by engineering the service to have an extremely low burn rate (basically my living expenses plus change), I don't need to make that much money off it that it's viable to keep it close to the way it is.


How do you submit a site? I know of a few good, small blogs and forums that don't come up. They're not on a VPS, either.



Perhaps related (or I'm just not sure how this works), what criteria goes into whether a crawled site is indexed? My personal blog has 31 pages "known", 32 crawled, and only 3 indexed.


Whether it appears to be in English, whether there appears to be enough text, things like that.

If you post the domain name (or email me) I can take a closer look tomorrow.


I really appreciate the candor in this post.


congrats! you should put the patreon link in more places IMHO (e.g. the main page and the various page footers)


Congratulations marginalia ! :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: