Locking without a timeout is indeed in the majority of use-cases a non-starter, we are agreed there.
The critical point that users must understand is that it is impossible to guarantee that the RedLock client never holds its lease longer than the timeout. Compounding this problem is that the longer you make your timeout to minimize the likelihood of this from accidentally happening, the less responsive your system becomes during genuine client misbehaviour.
In most real world scenarios, the tradeoffs are a bit softer than what people in the formal world dictates (and doing so they forced certain systems to become suboptimal for everything but during failures, kicking them out of business...). Few examples:
1. E-commerce system where there are a limited amount of items of the same kind, you don't want to oversell.
2. Hotel booking system where we don't want to reserve the same dates/rooms multiple times.
3. Online medical appointments system.
In all those systems, to re-open the item/date/... after some time it's ok, even after one day. And if the lock hold time is not too big, but a very strict compromise (it's also a reasonable choice in the spectrum), and it could happen that during edge case failures three items are sold and there are two, orders can be cancelled.
So yes, there is a tension between timeout, race condition, recovery time, but in many systems using something like RedLock the development and end-user experience can be both improved with a high rate of success, and the random unhappy event can be handled. Now the algorithm is very old, still used by many implementations, and as we are talking problems are solved in a straightforward way with very good performances. Of course, the developers of the solution should be aware that there are tradeoffs between certain values: but when are distributed systems easy?
P.S. why 10 years of strong usage count, in the face of a blog post telling that you can't trust a system like that? Because even if DS issues emerge randomly and sporadically, in the long run systems that create real-world issues, if they reach mass usage, are known. A big enough user base is a continuous integration test big enough to detect when a solution has real world serious issues. So of course RedLock users picking short timeouts with tasks that take a very hard to predict amount of time, will indeed incur into knonw issues. But the other systemic failure modes described in the blog post are never mentioned by users AFAIK.
I feel like you're dancing around admitting the core issue that Martin points out - RedLock is not suitable for systems where correctness is paramount. It can get close, but it is not robust in all cases.
If you want to say "RedLock is correct a very high percentage of the time when lease timeouts are tuned for the workload", I would agree with you actually. I even possibly agree with the statements "most systems can tolerate unlikely correctness failures due to RedLock lease violations. Manual intervention is fine in those cases. RedLock may allow fast iteration times and is worth this cost". I just think it's important to be crystal clear on the guarantees RedLock provides.
I first read Martin's blog post and your response years ago when I worked at a company that was using RedLock despite it not being an appropriate tool. We had an outage caused by overlapping leases because the original implementor of the system didn't understand what Martin has pointed out from the RedLock documentation alone.
I've been a happy Redis user and fan of your work outside of this poor experience with RedLock, by the way. I greatly appreciate the hard work that has gone into making it a fantastic database.
The critical point that users must understand is that it is impossible to guarantee that the RedLock client never holds its lease longer than the timeout. Compounding this problem is that the longer you make your timeout to minimize the likelihood of this from accidentally happening, the less responsive your system becomes during genuine client misbehaviour.