Hacker Newsnew | past | comments | ask | show | jobs | submit | wibyweb's commentslogin

Not sure how yacy works exactly except that its some sort of grand decentralized search engine project that is somehow all connected together.


Thank you, I do hope you get around to building a wiby engine!


You can certainly change the crawler's database connection from "localhost" to an IP address on a different machine, but I am unsure how that works with that type of proxy (I had to look up what a socks5 is). Sounds like it can work though.


Well what I meant was, say I have a server with the ip 13.223.12.212 and I want to run the crawler there, however would like to crawl the actual websites with the ip 23.215.23.15 (aka my proxy, socks5 being one of the several protocols to do it)

If you get what i mean :P

I assume it's possible if I just change some of the curl options in the crawler code


Do let me know if you succeed


Thank you! Makes me happy to see people enjoying it.


A group of people would have to band together to share the same table (windex) with each other. Personally I am interested in seeing people try to cultivate their own niche indexes instead of working towards a common one.


Thank you, my index is puny compared to yours and that is because I only index the page submitted by guests and don't go any further. The index is in the 10's of thousands. The hyperlink crawling feature was added and tested out specifically for the release, as I understand that some people will want to use that heavily and to build up a much larger index. My computers were super cheap and handle my puny index well, because its puny. I have no idea how well my approach compares to others in terms of performance.


With this code you will start out with a blank index and will have to start making submissions to your search engine to build the index, but you can search the results as soon as the pages get crawled. The video demo provides a practical example.


>how big did the fulltext table became for x entries on wiby.me

I want Wiby to be comprised mainly of human submitted pages, so for 99% of the index, only the pages submitted by users are indexed and no further crawling was done. However I recognized that not having the capability to crawl through links would not make it useful for others, so I added in the crawling capability to my liking and tested it accordingly. I imagine others might want to depend heavily on hyperlink crawling for their use case, but there is a tradeoff in the quality of the pages that get indexed and the resources they require.

>and what is a common response time on N amount of searches per minute for this dataset?

Hard to say exactly as I haven't run many benchmarks, but my goal is to keep multi-word queries to within about a second. Single-word queries are very fast. My 4 computers handle hundreds of thousands of queries per day because Wiby is being barraged by a nasty spam botnet with thousands of constantly changing IPs. If I don't keep them in check they will eventually eat all the CPU availability.

>Would you offer a /traffic or /stats page within about/ ? duckduckgo shows traffic, not index stats though.

Probably not on mine since I don't get enough traffic for it to be of that much interest to me. I privately use goaccess to get a general idea of daily traffic.


i like this approach as a possible use for a personal searchengine, that only has stuff that i have been looking at. for that it would be helpful to have some kind of browser extension that can autosubmit everything in my history. ideally that extension would also autoaccept every submission so that it can work fully in the background without my intervention.

also helpful would be a whilelist/blacklist feature, say, wikipedia and stackoverflow may always be autoaccepted while certain other sites may always be rejected, and the rest go through the regular review process.

then i can use that as my default search engine and branch out when i don't find what i am looking for. for that it would also be cool if there could be a way to search wiby and another search engine in paralell and display like 5 results from each.


Perhaps you can develop such a browser extension. Sounds like a very good idea actually.


>Wiby is being barraged by a nasty spam botnet with thousands of constantly changing IPs.

Short of having a private beta like kagi, how else could those bot nets be excluded? How difficult is it to create a white list of uninfected IPs?


For what it's worth, I ended up putting Marginalia Search behind Cloudflare to deal with what I assume is the same group. At worst I saw 30k queries per hour.


Who the hell has the incentive to shut down by force small, independent search engines? Competitors?


My unsubstantiated hunch based on looking at the types of queries, which at least for me were over-specified as all hell and within the sphere of pharmaceuticals, e-shopping and the like, is that they're gambling on the search engine being backed by Google or Bing, and they're effectively trying to poison their typeahead suggestion data.

I'd guess they're just aiming their gatling gun at whatever sites has an opensearch specification without much oversight.

It's also crossed my mind it might be some sketchy law firm looking for DMCA violations, since a fair bunch of the queries looked like they were after various forms of contraband. Seems weird they'd use a botnet though. Like most of the IPs seemed to be like enterprise routers with public facing admin pages and the like. Does not seem above board at all.


What is the botnet owner thinking of gaining from a small potatoes search engine? Seems rather futile?


I wish I knew. They have nothing to gain. Its effectively a DDoS attack.


In late April up to now, Wiby (a small mostly unheard of search engine) began having the exact same issue. Tens of thousands of the exact same type of "powered by..." requests coming from thousands of IPs. They are using a tool called QHub.


Thanks for wiby.me. I have seen QHub coming up in the scraping footprints, but my assumption has been that the footprint query is looking for Question and Answers sites powered by QHub containing their targeted terms, e.g. because there's a known vulnerability with QHub that their scripts can exploit to auto-post backlinks or whatever it is they do. There are lots of other hosting tools, other than QHub, that come up in the footprints as well. I found some lists of footprints by doing an internet search for one of them: "Designed by Mitre Design and SWOOP".


Interesting, thanks for that extra info.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: