Just in case if you missed it, my friend Michael and I have been building a self-driving golf cart. For more information, please check out the link here. The code for this post is on Github.
Introduction

End to end deep learning has been popular among self-driving car researchers. In my previous posts [1, 2], I mentioned that Michael and I have been working hard to develop an end to end deep learning system for steering angle prediction. However, our algorithm has any flaws, specifically, the learning results were not very transferable. One of the intuitive flaws is that the deep learning model only considers one frame as its input. However, humans have a decent memory of what has happened in the recent past and use that memory to inform our decisions about the future. This same logic can be applied to steering / controlling a vehicle. Intuitively, the model should have an understanding of the recent past as well as the spatial features — spatial-temporal understanding.
Eventually, with a new model proposed by DeepMind, I was able to achieve
0.0530 validation RMSE loss, which could land me among the top five teams for the Udacity self-driving challenge #2.
A Trip Down Memory Lane…
In 1989, ALVINN, the self-driving car (truck) made by Dr. Dean Pomerleau and his team, drove around the Carnegie Mellon campus. According to Pomerleau, The vehicle was powered by a CPU slower than the Apple Watch. The car used a fully connected neural network to predict the steering angle of the car in real time. Fast forward twenty years, NVIDIA proposed a novel method that combines Pomerleau’s idea with the modern GPU, giving NVIDIA’s car the capability to accurately perform real-time end to end steering prediction. Around the same time, Udacity held a challenge that asked researchers to create the best end to end steering prediction model. This project is deeply inspired by that competition, and the goal is to further the work in behavioral cloning for self-driving vehicles.
An Old Method — Convolutional LSTM
This novel architecture is nothing new in the machine learning world. In a 2015 paper, this technique is described as:
[CNN LSTMs are] a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs
Essentially, the model takes advantage of the spatial feature extraction capabilities of CNNs and the temporal learning abilities of LSTM to create a network that understands both spatial and temporal information. I created this basic model based on some available open source projects.

LSTM Results
Despite this promising architecture, the neural network did out significantly outperform its single frame counterpart.
Validation scores is a method to benchmark deep learning models. The predictions are performed on data that the model had never seen before. The goal is to minimize the loss value.
Instead of scoring a 0.124 on the validation dataset, this new model only scored 0.11. A mere 0.01 difference for such a drastically different model design. I repeated this experiment with augmentation to increase the dataset variation. Such techniques yielded similar results with a validation loss of 0.11 – 0.13. After a couple of weeks, I realized that it was time to move on to a better architecture.
The New Method — Inflated 3D ConvNet
In mid-2017, Google DeepMind proposed a novel video analysis dataset used for action recognition. The paper introduced that:
We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters.
This effective architecture has proven to be ideal for video analysis, thus becoming a great candidate for end-to-end behavioral cloning. The paper proposed three networks, one using multiple frames of RGB images as input, one using multiple frames of optical flow results as input, the final one being a combination of the both. Training has proven that the first architecture leverages both speed and accuracy, thus producing a viable solution for real-world applications. Here is an overview of the model.

i3D Results
This 3D ConvNet model delivered the best results. All of the models are evaluated using the same validation dataset and the same loss function. Furthermore, they are trained for the equal amount of time. The 64 frames input version cut the original loss in half, even landing a place among the top 5 teams of the Udacity self-driving challenge #2. Intuitively, more input frames should provide the network with more information through time.

Conclusion & Future Work
With this most recent development, I can improve the current steering system on the self-driving golf cart. I hope that the great mathematical results could yield real-world improvements.
By no means is this work complete. My experimentation with ConvLSTM was very limited. The number of input frames was much smaller than the 3D ConvNet. However, another obvious advantage of the new model is its efficiency. It could run up to 15 fps on the NVIDIA GTX1080.
Behavioral cloning is not limited to steering. I have been researching collision prevention and speed estimation as well. (more on that soon…) I hope you have found this informative. If you are interested, the source code is available on Github. Please check it out. You can also reach out to me at contact@neilnie.com. Thanks for stopping by!
If you are interested in learning more about the self-driving golf cart project, you might enjoy the following posts.
- Deep learning steering prediction
- Semantic segmentation
- Robot Operating System
One thought on “Developing a Better End-to-End Steering Model”