But that’s just the marking phase, isn’t it? And most of it can be done fully in parallel, so while not all CPU cores can be maxed out with that, more often than not the original problem itself can be hard to parallelize to that level, so “wasting” a single core may very well be worth it.
Part of the problem though, is that the object graph walk pretty quickly is non contiguous, regardless of how it's laid out in memory.