This paper presents a viewpoint from computer vision to the radar echo extrapolation task in the precipitation nowcasting domain. Inspired by the success of some convolutional recurrent neural network models in this domain, including convolutional LSTM, convolutional GRU and trajectory GRU, we designed a new sequence-to-sequence neural network structure to leverage these models in a realistic data context. In this design, we decreased the numbers of channels in high abstract recurrent layers rather than increasing them. We formulated the task as a problem of encoding five radar images and predicting 10 steps ahead at the pixel level, and found that using only the common mean squared error can misguide the training and mislead the testing. Especially, the image quality of last predictions usually degraded rapidly. As a solution, we employed some visual image quality assessment techniques including Structural Similarity (SSIM) and multi-scale SSIM to train our models. Experimental results show that our structure was more tolerant to increasing uncertainty in the data, and the use of image quality metrics can significantly reduce the blurry image issue. Moreover, we found that using SSIM was very effective and a combination of SSIM with mean squared error and mean absolute error yielded the best prediction quality.