In recent years, there has been an increased interest in point cloud representation for visualizing digital humans in cross reality. However, due to their voluminous size, point clouds require high bandwidth to be transmitted. In this paper, we propose a temporal interpolation architecture capable of increasing the temporal resolution of dynamic digital humans, represented using point clouds. With this technique, bandwidth savings can be achieved by transmitting dynamic point clouds in a lower temporal resolution, and recreating a higher temporal resolution on the receiving side. Our interpolation architecture works by first downsampling the point clouds to a lower spatial resolution, then estimating scene flow using a newly designed neural network architecture, and finally upsampling the result back to the original spatial resolution. To improve the smoothness of the results, we additionally apply a novel technique called neighbour snapping. To be able to train and test our newly designed network, we created a synthetic point cloud data set of animated human bodies. Results from the evaluation of our architecture through a small-scale user study show the benefits of our method with respect to the state of the art in scene flow estimation for point clouds. Moreover, correlation between our user study and existing objective quality metrics confirm the need for new metrics to accurately predict the visual quality of point cloud contents.