Currently it's very difficult to reason about object geometric dimensions / movement from video alone. We (humans) are decent at it and we still make mistakes. "The right ML algorithm" for this is basically a problem as hard as achieving singularity.