In the defence and security domain, camera systems are widely used for surveillance. The major advantage of using camera systems for surveillance is that they provide high‐resolution imagery, which is easy to interpret. However, the use of camera systems and optical imagery has some drawbacks, especially for application in the military domain. In poor lighting conditions, dust or smoke the image quality degrades and, additionally, cameras cannot provide range information too. These drawbacks can be mitigated by exploiting the strengths of radar. Radar performance can be largely maintained during the night, in various weather conditions and in dust and smoke. Moreover, radar provides the distance to detected objects. Since, the strongpoints and weaknesses of radar and camera systems seem complementary, a natural question is: can radar and camera systems learn from each other? Here the potential of radar/video multimodal learning is evaluated for human activity classification. The novelty of this work is the use of radar spectrograms and related video frames for classification with a multimodal neural network. Radar spectrograms and video frames are both two‐dimensional images, but the information they contain is of different nature. This approach was adopted to limit the required preprocessing load, while maintaining the complementary nature of the sensor data.