> I wasn't intending to include transient filesystems in the index. There's abso...

lelanthran · on Dec 20, 2023

> So the question is: how do you know, whether /some/random/file has been modified while your daemon or application wasn't running or the filesystem wasn't mounted on your system, without performing a stat call on it? If you don't have an answer to that, which also needs to be orders of magnitudes faster, then you'll never match the performance of Everything.

Well, my intention is to match the feature list of Everything, but on Linux, and as far as I knew, Everything did not have full support for external drives - you'd have to convert them to NTFS, or add them to be indexed manually.

The use-case I've seen for Everything has always been for a local user searching their local PC; I wasn't even sure until now that Everything can sometimes search transient filesystems because know one I ever saw using it used it for files on a transient filesystem.

You're correct; what I cannot do is monitor transient filesystems; but doing permanent filesystems at a speed better than or equal to Everything is still better than anything I've used on Linux, many of which don't even search system files, nevermind transient filesystems. And they all use the locate db which is always a day or so out of date.

And yes, it can be done purely by monitoring filesystem changes. Sure, a full index needs to be built the first time, but that's a one-off cost - index updates after that should be fast enough to do for each write/remove/move operation that you can update the index dozens of times per second.

For non-transient filesystems, performance should be the same as, or better than, Everything.

wander_homer · on Dec 21, 2023

> And yes, it can be done purely by monitoring filesystem changes. Sure, a full index needs to be built the first time, but that's a one-off cost

And how do you build the full index initially without recursively walking the filesystem? Otherwise you're not going to match Everything's performance on initial index creation.

And regarding the second crucial question: How do you know that a file you saw the last time your app or daemon was running, hasn't been modified in the meantime?

You still haven't answered those two fundamental questions. Everything else are solved issues anyway.

> index updates after that should be fast enough to do for each write/remove/move operation that you can update the index dozens of times per second

Like I already said, that has never been a problem. My app can currently update the index several thousand times per second and there's still a lot of room for improvements with many low hanging fruits.

> For non-transient filesystems, performance should be the same as, or better than, Everything.

You keep saying that, but you're also not giving an answer to how you're going to solve the two major and pretty much only issues.

lelanthran · on Dec 21, 2023

Since it seems we hit the max thread limit (I can't reply to your reply to me), I'll post my reply here, quoting your post as best as I can.

>> I wasn't planning to; it's a once-off cost - the user experience while using any software isn't degraded by the installation time, is it?

> This whole topic started with you claiming that you can even beat Everything in that regard

Nope.

I never claimed that I can beat Everything in "reading the metadata when the app starts". I claimed that I can match the startup and search performance of Everything.

Those are two different claims, and the latter is obviously possible if the application performs queries by querying a daemon that is always running with an in-memory index.

> Remember, your response to:

>> A huge part of Everything's speed comes from reading the master file table that other people mentioned, so you would need a way to quickly read file table entries on linux.

> Was

>> Not a problem. And no, I'm not talking about inotify either, and I'll additionally index the contents of (text) files as well with a negligible additional performance hit. It can be done as fast as, or faster than, `Everything`.

It's "not a problem because the index will already be in in-ram and available before the user launches the app". You read that to mean "not a problem because there is a fast way to read file metadata on startup".

I think that there's a difference in that. My proposal is to never have the app need to read anything on startup (other than configuration settings).

> And btw. indexing content will obviously only put you even further behind. The cost is not negligible.

How will it put me further behind? I did say that it will only be done during software installation, right?

>> It's a daemon. if it isn't running while the user is using their desktop system, then it's not working or the user has turned it off.

>> My desktop system currently has 2.5m files. There are maybe a dozen files which will be modified during a maintenance-mode bootup, which has happened exactly zero times in the last decade.

I have many users, myself included, who use things like shared filesystems, which get modified by multiple systems. And like I already said, modern Linux systems also perform all of their updates in such a maintainance mode. So your app will give thousands of false positives or miss thousands of files completely on those systems.

There's two things there:

1. Shared filesystems - I don't care about this because Everything doesn't care about this being performant: In Everything, as far as I understand it, the user manually indexes network shares.

2. Maintenance modes won't give you thousands of false positives; at most you're looking at a diff of maybe dozens of index entries, if that.

> Sigh... So you're also not going to solve the second issue. I mean I clearly asked you these questions multiple times and I tried to make it clear, that this is where the problem is, to save you and me time, and still you kept it a secret up until now that you're not even attempting to fix those problems.

I didn't keep it a secret - I made it clear that a daemon will hold the index, and the app will talk to it, and that the index will be built once during software installation.

> So I'll have to take back my claim: Under these circumstances I can't guarantee that you'll make a lot of donations, because your app won't do anything special compared to others.

Well, it will be a few orders of magnitude faster to start up than checking for filesystem changes on startup, no?

>> For a Linux desktop file finding utility, monitoring all file writes, moves and deletes pretty much puts you ahead of any game in town right now, right?

> Well kind off, but it's not particularly difficult to solve that issue. The dev versions of FSeaerch already can do that.

If you don't mind me asking, how do you do it? Because inotify is out of the question if you want to monitor 2.5m files. Even for just the home directory you will run the risk of exhausting file descriptors by using inotify.

>> Issue 1 - Initial index creation: I will create the index during the s/ware installation process and never create it again unless it is missing. To speed the creation during installation, I will use the mlocate.db file if it is found.

> So you're doing exactly what everyone else is doing.

All the existing utilities create the index only during installation?

> You can also ingore the mlocate index, because it doesn't contain enough information (size, date modified, ... aren't indexed by it so you'd need to stat all of those files anyway).

>> Issue 2 - Files that are changed/moved/removed when daemon is turned off: I don't really care, mostly. Those files a) have such a small probability of both existing and being of interest to the desktop user that lottery jackpots have a higher chance of happening to the user

> Like I already said, you're ignoring the hard and important problem. That's fine, but you suggested otherwise and now you're again doing nothing out of the ordinary.

Which hard and important problem? That changes made in maintenance mode aren't seen?

>> I believe that this is enough to satisfy my original claim[1] of " "similar in performance and query capabilities""[2].

> Well, it depends, you're not going to beat Everthing in the areas me and others care and in an attempt to get anywhere near that, you're trading accuracy for speed.

Going from 100.0000000% accurate to 99.9999999% accurate is hardly "sacrificing accuracy for speed", considering that you're still in the statistical rounding error group.

> That's fine, but this is nothing new or special, so I'm not really interested in that.

"Faster than existing Linux tools" would, actually, be something new and novel. "Faster than Everything in some specific areas" almost certainly counts, especially when accuracy is within error bars.

I have one last batch of questions, after which I will simply shut up and get to coding something. I kinda hope that you will answer these questions.

A major feature of Everything when people wax on about its speed is how quickly new entries in the filesystem show up in the applications query results.

Even while the results is open, the user can see files that were added since the last keystroke.

1. How does FSearch handle this common and obvious use-case?

2. What's the newest filesystem change you can expect to see when performing a query in FSearch? Is it "the last change made prior to the application startup"? Is it "The last change made prior to the query"? Is it "The last change made since we walked the filesystem"?

3. What's the p99 for startup time in FSearch? The p99 for query results of N (where N is a suitably large number)?

4. You mentioned "areas that you and others care about". Can you briefly list the areas, other than complete and 100% accuracy during maintenance mode. All I know about is what Everthing users appear to care about, and they simply aren't caring about USB memory sticks, cameras plugged in, network drives, maintenance mod diffs, etc. They do appear to care that it is responsive.

wander_homer · on Dec 21, 2023

> Those are two different claims, and the latter is obviously possible if the application performs queries by querying a daemon that is always running with an in-memory index.

But the daemon also has to start at one point (you're just shifting the problem down that stack) and that's where it gets expensive IF you want to be as accurate as Everything. But of course, if you don't care about accuracy, starting up the daemon isn't time consuming. I've already discussed this with my users in the past and we settled for a toggle switch where users can opt-in to that behavior of more speed at the cost of having false results.

> How will it put me further behind? I did say that it will only be done during software installation, right?

Everything also only does this whenever a filesystem is first detected and scanned; still people care about the performance in those cases. Especially when you're often plugging in USB HDDs and such.

> 1. Shared filesystems - I don't care about this because Everything doesn't care about this being performant: In Everything, as far as I understand it, the user manually indexes network shares.

This is not only about network shares, but also about dual boot system, where multiple OSes use the same filesystem and USB HDDs/SSDs.

> 2. Maintenance modes won't give you thousands of false positives; at most you're looking at a diff of maybe dozens of index entries, if that.

Of course it does. Just in the last week ~13,000 files and folders were modified on my system with the system update (which ran in a maintenance boot environment where other daemons don't get started). That's 13,000 files and folders which will either be missing in your indexing solution or show up as false positives (because you're using outdated metadata, like their old size or timestamps).

> Well, it will be a few orders of magnitude faster to start up than checking for filesystem changes on startup, no?

Of course, but again that's not the problem. The problem is doing what Everything does: Start up a few orders of magnitude faster AND at the same time checking for filesystem changes on startup.

> If you don't mind me asking, how do you do it? Because inotify is out of the question if you want to monitor 2.5m files. Even for just the home directory you will run the risk of exhausting file descriptors by using inotify.

I'm using fanotify by default and inotify as a fallback in the case the filesystem or kernel doesn't support fanotify with the feature set I need. Running out of file descriptors is usually not an issue, because you don't need to keep file descriptors open for all files. My system has more than 3 million files and even using just inotify for that does work.

> All the existing utilities create the index only during installation?

Obviously not all, because some don't even create an index to begin, but many do.

And btw. I doubt that your solution, of creating an index only once, even works, because sooner or later you need to rescan larger parts of the filesystem, when the inconsistencies become to frequent (like when you suddenly become filesystem change notifications for files which you didn't even know about).

> Which hard and important problem? That changes made in maintenance mode aren't seen?

Getting the index in a consistent state with the filesystem after boot.

> A major feature of Everything when people wax on about its speed is how quickly new entries in the filesystem show up in the applications query results.

> Even while the results is open, the user can see files that were added since the last keystroke.

> 1. How does FSearch handle this common and obvious use-case?

It detects filesystem events with fanotify, queues some of them for batch processing, then applies them to the index and results.

> 2. What's the newest filesystem change you can expect to see when performing a query in FSearch? Is it "the last change made prior to the application startup"? Is it "The last change made prior to the query"? Is it "The last change made since we walked the filesystem"?

In the development version with monitoring support changes to the filesystem show up in the results almost immediately; it's usually less than a second. Only in the rare case when many thousand files get modified almost simultaneously, it can take a few more seconds. Hence when you sort your results by date modified, you can live monitor all the recent changes that are being made on your system.

> 3. What's the p99 for startup time in FSearch? The p99 for query results of N (where N is a suitably large number)?

This depends on the storage type. But on modern SSDs with a few million files it's usually a second or so to load the index from the database file. You can then search right away and depending on whether you've configured the system to also be accurate or not, a rescan might be triggered in the background, which obviously takes much longer to finish, but then you'll guaranteed to have correct results.

> 4. 4. You mentioned "areas that you and others care about". Can you briefly list the areas, other than complete and 100% accuracy during maintenance mode. All I know about is what Everthing users appear to care about, and they simply aren't caring about USB memory sticks, cameras plugged in, network drives, maintenance mod diffs, etc. They do appear to care that it is responsive.

I'll have to answer that in a few hours if you don't mind, I have to get going now.

lelanthran · on Dec 21, 2023

> And how do you build the full index initially without recursively walking the filesystem? Otherwise you're not going to match Everything's performance on initial index creation.

I wasn't planning to; it's a once-off cost - the user experience while using any software isn't degraded by the installation time, is it?

> And regarding the second crucial question: How do you know that a file you saw the last time your app or daemon was running, hasn't been modified in the meantime?

It's a daemon. if it isn't running while the user is using their desktop system, then it's not working or the user has turned it off.

In any case, if a component of the software is not running, then the software is not running.

I mean, seriously, even during regular updates, daemons still run. Even during distro upgrades daemons are still running. The rare cases where files are removed/changed/moved while daemons are turned off are fractions of fractions of a percentage.

My desktop system currently has 2.5m files. There are maybe a dozen files which will be modified during a maintenance-mode bootup, which has happened exactly zero times in the last decade.

For a Linux desktop file finding utility, monitoring all file writes, moves and deletes pretty much puts you ahead of any game in town right now, right?

> You keep saying that, but you're also not giving an answer to how you're going to solve the two major and pretty much only issues.

Perfect is the enemy of good.

Issue 1 - Initial index creation: I will create the index during the s/ware installation process and never create it again unless it is missing. To speed the creation during installation, I will use the mlocate.db file if it is found.

Issue 2 - Files that are changed/moved/removed when daemon is turned off: I don't really care, mostly. Those files a) have such a small probability of both existing and being of interest to the desktop user that lottery jackpots have a higher chance of happening to the user, and b) After an MVP, if the userbase requests those files, I'll either hardcode their locations and always check only for those dozens of files that can possibly be changed when daemons are turned off, or allow the user to specify via configuration, the pathname patterns to always check.

I believe that this is enough to satisfy my original claim[1] of " "similar in performance and query capabilities""[2].

[1] https://news.ycombinator.com/item?id=38686022 [2] I don't recall making any claim along the lines of "walking the filesystem tree is never used".

wander_homer · on Dec 21, 2023

> I wasn't planning to; it's a once-off cost - the user experience while using any software isn't degraded by the installation time, is it?

This whole topic started with you claiming that you can even beat Everything in that regard, which is why I even got involved in that discussion.

Remember, your response to:

> A huge part of Everything's speed comes from reading the master file table that other people mentioned, so you would need a way to quickly read file table entries on linux.

Was

> Not a problem. And no, I'm not talking about inotify either, and I'll additionally index the contents of (text) files as well with a negligible additional performance hit. It can be done as fast as, or faster than, `Everything`.

And btw. indexing content will obviously only put you even further behind. The cost is not negligible.

> It's a daemon. if it isn't running while the user is using their desktop system, then it's not working or the user has turned it off.

> My desktop system currently has 2.5m files. There are maybe a dozen files which will be modified during a maintenance-mode bootup, which has happened exactly zero times in the last decade.

I have many users, myself included, who use things like shared filesystems, which get modified by multiple systems. And like I already said, modern Linux systems also perform all of their updates in such a maintainance mode. So your app will give thousands of false positives or miss thousands of files completely on those systems.

Sigh... So you're also not going to solve the second issue. I mean I clearly asked you these questions multiple times and I tried to make it clear, that this is where the problem is, to save you and me time, and still you kept it a secret up until now that you're not even attempting to fix those problems.

So I'll have to take back my claim: Under these circumstances I can't guarantee that you'll make a lot of donations, because your app won't do anything special compared to others.

> For a Linux desktop file finding utility, monitoring all file writes, moves and deletes pretty much puts you ahead of any game in town right now, right?

Well kind off, but it's not particularly difficult to solve that issue. The dev versions of FSeaerch already can do that.

> Issue 1 - Initial index creation: I will create the index during the s/ware installation process and never create it again unless it is missing. To speed the creation during installation, I will use the mlocate.db file if it is found.

So you're doing exactly what everyone else is doing. You can also ingore the mlocate index, because it doesn't contain enough information (size, date modified, ... aren't indexed by it so you'd need to stat all of those files anyway).

> Issue 2 - Files that are changed/moved/removed when daemon is turned off: I don't really care, mostly. Those files a) have such a small probability of both existing and being of interest to the desktop user that lottery jackpots have a higher chance of happening to the user

Like I already said, you're ignoring the hard and important problem. That's fine, but you suggested otherwise and now you're again doing nothing out of the ordinary.

> I believe that this is enough to satisfy my original claim[1] of " "similar in performance and query capabilities""[2].

Well, it depends, you're not going to beat Everthing in the areas me and others care and in an attempt to get anywhere near that, you're trading accuracy for speed (what makes Everything special is that it's both fast and accurate/reliable). That's fine, but this is nothing new or special, so I'm not really interested in that.