Recognition of continuous sign language is challenging as the number of words is a sentence and their boundaries are unknown during the recognition stage. This work proposes a two-stage solution in which the number of words in a sign language sentence is predicted in the first stage. The sentence is then temporally segmented accordingly and each segment is represented in a single image using a novel solution that entails summation of frame differences using motion estimation and compensation. This results in a single image representation per sign language word referred to as a motion image. CNN transfer learning is used to convert each of these motion images into a feature vector which is used for either model generation or sign language recognition. As such, two deep learning models are generated; one for predicting the number of words per sentence and the other for recognizing the meaning of the sign language sentences. The proposed solution of predicting the number of words per sentence and thereafter segmenting the sentence into equal segments worked well. This is because each motion image can contain traces of previous or successive words. This byproduct of the proposed solution is advantageous as it puts words into context, thus justifying the excellent sign language recognition rates reported. It is shown that bidirectional LSTM layers result in the most accurate models for both stages. In the experimental results section we use an existing dataset that contains 40 sentences generated from 80 sign language words. The experiments revealed that the proposed solution resulted in a word and sentence recognition rates of 97.3% and 92.6% respectively. The percentage increase over the best results reported in the literature for the same dataset are 1.8% and 9.1% for both word and sentences recognitions respectively.