Many applications of quantum simulation require one to prepare and then characterize quantum states by efficiently estimating k-body reduced density matrices (k-RDMs), from which observables of interest may be obtained. For instance, the fermionic 2-RDM contains the energy, charge density, and energy gradients of an electronic system, while the qubit 2-RDM contains the spatial correlation functions of magnetic systems. Naive estimation of such RDMs requires repeated state preparations for each matrix element, which makes for prohibitively large computation times. However, commuting matrix elements may be measured simultaneously, allowing for a significant cost reduction. In this work, we design schemes for such a parallelization with near-optimal complexity in the system size N. We first describe a scheme to sample all elements of a qubit k-RDM using only O(3klogk-1N) unique measurement circuits, an exponential improvement over prior art. We then describe a scheme for sampling all elements of the fermionic 2-RDM using only O(N2) unique measurement circuits, each of which requires only a local O(N)-depth measurement circuit. We prove a lower bound of ω(ϵ-2Nk) on the number of state preparations, Clifford circuits, and measurement in the computational basis required to estimate all elements of a fermionic k-RDM, making our scheme for sampling the fermionic 2-RDM asymptotically optimal. We finally construct circuits to sample the expectation value of a linear combination of ω anticommuting two-body fermionic operators with only O(ω) gates on a linear array. These circuits allows for sampling any linear combination of fermionic 2-RDM elements in O(N4/ω) time, with a significantly lower measurement circuit complexity than prior art. Our results improve the viability of near-term quantum simulation of molecules and strongly correlated material systems.