How do you retrieve other people's compute time ? Or do you have a special deal with each provider so they allow you to do that maybe ? Or maybe I completely misunderstood the thing
How does this design compare to using channels to send data to a dedicated handlers. When using channels i've found multiple issues:
(1) Web-shaped code that is often hard to follow along
(2) Requires to manually implement message types that can then be converted to network-sendable messages
(3) Requires to explicitly give a transmitter to interested/allowed entities
(4) You get a result if your channel message failed to transmit but NOT if your message failed to transmit over network
But besides that it's pretty convenient. Let's say you have a ws_handler channel, you just send your data through that and there is a dedicated handler somewhere that may or may not send that message if it's able to.
For 4 you can implement that with a channel passed along with the message to send a result back. You can then block the sending side all the way to the callsite if you wish.
My feeling is that sans-IO is particularly useful for libraries, although it can be used for applications too. In a library it means you don't force decisions about how I/O happens on your consumer, making it strictly more useful. This is important for Rust because there's already a bunch of ecosystem fragmentation between sync and async IO(not to mention different async runtimes)
The line between applications and libraries is fairly blurry, isn't it? In my experience, most applications grow to the point where you have internal libraries or could at least split out one or more crates.
I would go as far as saying that whatever functionality your application provides, there is a core that can be modelled without depending on IO primitives.
In my eyes an ideal library should not contain state, internally allocate (unless very obviously), or manage processes. The application should do that, or provide primitives for doing it which the library can make use of. That makes applications and libraries very very different in my mind.
The thing about state is a good point. With the sans-IO pattern we have inversion of IO and Time, but adding memory to that would be a nice improvement too.
Those C libraries that have initializers which take ** and do the allocation for you drive me nuts! I’m sure there’s some good reason, but can’t you trust me to allocate for myself, you know?
Yes true, the one difference might be that you don't expect other consumers with a different approach to IO to use your internal libraries, although it does help you if you want to change that in the future and the testability is still useful
Channels work fine if you are happy for your software to have an actor-like design.
But as you say, it comes with problems: Actors / channels can be disconnected for example. You also want to make sure they are bounded otherwise you don't have backpressure. Plus, they require copying so achieving high-throughput may be tricky.
To scrape the websites, do you just blindly cut all of the HTML into defined size chunks or is there some more sophisticated logic to extract text of interest ?
I'm wondering because most news websites now have a lot of polluting elements like popups, would they also go into the database ?
If you look at the vector handler in his code, he is using blue Monday sanitizer and doing some "replaceAll".
So I think there may be some useless data in the vector, but that may not be a issue since it is coming from multiple sources (for simple question at least)