Wide area surveillance (WAS) requires high spatiotemporal resolution (HSTR) video for better precision. As an alternative to expensive WAS systems, low-cost hybrid imaging systems can be used. This paper presents the usage of multiple video feeds for the generation of HSTR video as an extension of reference based super resolution (RefSR). One feed captures video at high spatial resolution with low frame rate (HSLF) while the other captures low spatial resolution and high frame rate (LSHF) video simultaneously for the same scene. The main purpose is to create an HSTR video from the fusion of HSLF and LSHF videos. In this paper we propose an end-to-end trainable deep network that performs optical flow (OF) estimation and frame reconstruction by combining inputs from both video feeds. The proposed architecture provides significant improvement over existing video frame interpolation and RefSR techniques in terms of PSNR and SSIM metrics and can be deployed on drones with dual cameras.