Just in case if you missed it, my friend Michael and I have been building a self-driving golf cart. For more information, please check out the link here. The code for this post is on Github.


Image result for nvidia dave self driving

End to end deep learning has been popular among self-driving car researchers. In my previous posts [1, 2], I mentioned that Michael and I have been working hard to develop an end to end deep learning system for steering angle prediction. However, our algorithm has any flaws, specifically, the learning results were not very transferable. One of the intuitive flaws is that the deep learning model only considers one frame as its input. However, humans have a decent memory of what has happened in the recent past and use that memory to inform our decisions about the future. This same logic can be applied to steering / controlling a vehicle. Intuitively, the model should have an understanding of the recent past as well as the spatial features — spatial-temporal understanding.

Eventually, with a new model proposed by DeepMind, I was able to achieve 

0.0530 validation RMSE loss, which could land me among the top five teams for the Udacity self-driving challenge #2. 

A Trip Down Memory Lane…

In 1989, ALVINN, the self-driving car (truck) made by Dr. Dean Pomerleau and his team, drove around the Carnegie Mellon campus. According to Pomerleau, The vehicle was powered by a CPU slower than the Apple Watch. The car used a fully connected neural network to predict the steering angle of the car in real time. Fast forward twenty years, NVIDIA proposed a novel method that combines Pomerleau’s idea with the modern GPU, giving NVIDIA’s car the capability to accurately perform real-time end to end steering prediction. Around the same time, Udacity held a challenge that asked researchers to create the best end to end steering prediction model. This project is deeply inspired by that competition, and the goal is to further the work in behavioral cloning for self-driving vehicles.

An Old Method — Convolutional LSTM

This novel architecture is nothing new in the machine learning world. In a 2015 paper, this technique is described as:

[CNN LSTMs are] a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs

Essentially, the model takes advantage of the spatial feature extraction capabilities of CNNs and the temporal learning abilities of LSTM to create a network that understands both spatial and temporal information. I created this basic model based on some available open source projects.

Screenshot from 2018-07-06 22-37-50
The ConvLSTM network architecture, generated using Keras.

LSTM Results

Despite this promising architecture, the neural network did out significantly outperform its single frame counterpart.

Validation scores is a method to benchmark deep learning models. The predictions are performed on data that the model had never seen before. The goal is to minimize the loss value.

Instead of scoring a 0.124 on the validation dataset, this new model only scored 0.11. A mere 0.01 difference for such a drastically different model design. I repeated this experiment with augmentation to increase the dataset variation. Such techniques yielded similar results with a validation loss of 0.11 – 0.13. After a couple of weeks, I realized that it was time to move on to a better architecture.

The New Method — Inflated 3D ConvNet

In mid-2017, Google DeepMind proposed a novel video analysis dataset used for action recognition. The paper introduced that:

We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters.

This effective architecture has proven to be ideal for video analysis, thus becoming a great candidate for end-to-end behavioral cloning. The paper proposed three networks, one using multiple frames of RGB images as input, one using multiple frames of optical flow results as input, the final one being a combination of the both. Training has proven that the first architecture leverages both speed and accuracy, thus producing a viable solution for real-world applications. Here is an overview of the model.

Inflated 3D convolutional neural network architecture visualization.

i3D Results

This 3D ConvNet model delivered the best results. All of the models are evaluated using the same validation dataset and the same loss function. Furthermore, they are trained for the equal amount of time. The 64 frames input version cut the original loss in half, even landing a place among the top 5 teams of the Udacity self-driving challenge #2. Intuitively, more input frames should provide the network with more information through time.

Screen Shot 2018-08-02 at 11.21.58 PM.png
The 64 frame i3D network significantly outperformed the traditional ConvNet model.

Conclusion & Future Work

With this most recent development, I can improve the current steering system on the self-driving golf cart. I hope that the great mathematical results could yield real-world improvements.

By no means is this work complete. My experimentation with ConvLSTM was very limited. The number of input frames was much smaller than the 3D ConvNet. However, another obvious advantage of the new model is its efficiency. It could run up to 15 fps on the NVIDIA GTX1080.

Behavioral cloning is not limited to steering. I have been researching collision prevention and speed estimation as well. (more on that soon…) I hope you have found this informative. If you are interested, the source code is available on Github. Please check it out. You can also reach out to me at contact@neilnie.com. Thanks for stopping by!

If you are interested in learning more about the self-driving golf cart project, you might enjoy the following posts.

  1. Deep learning steering prediction
    1. Visualizing the Steering Model with Attention Maps
    2. Successfully Tested the Autonomous Steering System for the Self-Driving Golf Cart
    3. Predicting Steering Angle with Deep Learning — Part 2
    4. Predicting Steering Angle with Deep Learning — Part 1
  2. Semantic segmentation
    1. The Robustness of the Semantic Segmentation Network
    2. Autonomous Cruise Control System
    3. Understanding the World Through Semantic Segmentation
  3. Robot Operating System
    1. Hello, ROS
    2. Open Street Map with ROS
    3. Self-Driving Software + Carla Simulator
    4. GPS Localization with ROS, rviz, and OSM
Posted by:NeilNie

Student at Columbia University, School of Engineering and Applied Sciences. Prev. software engineer intern at Apple. More than six years of experience developing iOS and macOS applications. Experienced in electrical engineering/microcontrollers. From publishing several apps to presented a TEDx talk in machine learning. Striving to use my knowledge, skills, and passion to positively impact the world.

One thought on “Developing a Better End-to-End Steering Model

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.