Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yo, the author here. Thanks for the feedback. I totally meant idempotency, drat. (In fact, on Hadoop, thanks to speculative execution of reduce tasks, you also have to worry a bit about reentrancy, but what I was talking about was, in fact, idempotency).

Shutting down the pipeline: I hear you on prod/non-prod. For our setup, the pipeline ends up writing to a datastore, so if we kill the pipeline, the datastore is still up, it just stops updating. Which is working so far. May end up flagging suspect data as you suggest, instead of the full stop (or only full stop if more than a very small percentage of the data is suspect).



No problem. I am not too familiar with Hadoop, but those speculative reduce tasks sound like a real blast to debug.

I can see why the approach in your blog would have a lot appeal in that environment. It sounds like some sort of error flagging in combination with a set of heuristics around what failed, how often, what time of day, etc would be the way to go.

I find that intelligent monitoring systems like that are ultimately necessary in systems like this anyway, you just usually end up discovering that the hard way (I know I have, several times. It's one of those lessons you are tempted to unlearn in the interests of expediency). Does Hadoop help you out with that sort of thing?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: