One of the most important aspects of a self-driving vehicle is to understand the world around us. We deeply believe that it’s possible to use camera and computer vision technology to achieve this, instead of using traditional radar or lidar. Once we can gain that understanding, we can program the vehicle to make decisions.
The algorithm behind this is semantic segmentation with Convolutional Neural Networks. This technology allows us to overlay masks on top of the image to understand where the road is, where the vehicles are, etc. In short, we are classifying every individual pixel of our input image.
Also, our approach is not an end to end one. Unlike the autonomous steering system, where the image input is directly mapped to a steering output. In semantic segmentation, we don’t directly map the visuals from the camera to any decisions.
The goal is to use the segmentation mask returned by the network to make some decision about the vehicle’s behavior.
About the Network:
Throughout the years, after the initial proposal of FCN (fully convolutional network), there are many different network architectures for image segmentation. Namely, SegNet and DeepLab. Despite their outstanding performance, those networks are not suited for our application with a very limited amount of processing power.
We are using ENet, a real-time image segmentation network, proposed by researchers at Purdue University and the University of Warsaw, Poland. Our implementation is done in Keras, as usual.
The benefit of ENet is that it can run at ~10fps on the Jetson. This is a huge benefit, especially in an application such as a self-driving cart. The network only has 300k parameters, but it performed very well on training.
Development & Training
We used the CityScapes dataset for training. Specifically, we used their extra_training data with extra_coarse labels. During training, we unsurprisingly realized that more data point will tremendously increase the performance.
However, we hit a couple of bumps during development.
- We decided to generate a CSV file with all the image paths in order to train the model. This way, the training generator stays simple.
- We also implemented helper methods to turn the group truth (label) image, (shape: width * height * 3), to a (width * height * the number of labels). In another word, each category, ex, human, cars, or road, will have its own layer in the output matrix.
- We also took out some of the unnecessary labels. It made training and inference a lot faster.
As you can see, after only a couple of hours of training, the network performed very well on the validation dataset.
We are eager to implement this system on the vehicle. We will also program some rules for the vehicle to follow, once it understands its surroundings. If you have any questions or comments, please contact me at firstname.lastname@example.org. Thank you!