Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation