Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is the reason i am big fan of running any software with separate users and setting ulimit to a low value so that something stupid like this cannot impact the production service. I would be super keen to try to replicate this scenario on my test cluster and see if my settings catching it. Does anybody know if the software in question is an opensource tool?


This is the approach I take also. I'm also looking at totally disabling the OOM killer because it seems to be pretty useless. Anytime I see stuff killed by OOM the culprit is usually and obviously some runaway Java process, but OOM inevitably picks the SSH daemon to kill, which doesn't help anything, and the box continues to swap so badly that it just seems unrecoverable. I'd rather just have the box panic and reboot if it's truly out of memory.


I have not looked into it at all, but can you not exempt sshd from the OOM killer?


I looked into it a little bit. There are ways to tune it but I didn't see a way to exempt processes by name. It may be possible.

The scenario I described above is HPC clusters in a university environment. The problem is students running programs that are poorly written. I'd rather reboot the node and tell them to fix their code than deal with trying to accommodate their careless / naive programming.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: