One of the most important aspects of a self-driving vehicle is to understand the world around us. We deeply believe that it’s possible to use camera and computer vision technology to achieve this, instead of using traditional radar or lidar. Once we can gain that understanding, we can program the vehicle to make decisions.

The algorithm behind this is semantic segmentation with Convolutional Neural Networks. This technology allows us to overlay masks on top of the image to understand where the road is, where the vehicles are, etc. In short, we are classifying every individual pixel of our input image.

Also, our approach is not an end to end one. Unlike the autonomous steering system, where the image input is directly mapped to a steering output. In semantic segmentation, we don’t directly map the visuals from the camera to any decisions.

An inference result from the segmentation network

The goal is to use the segmentation mask returned by the network to make some decision about the vehicle’s behavior.

About the Network:

Throughout the years, after the initial proposal of FCN (fully convolutional network), there are many different network architectures for image segmentation. Namely, SegNet and DeepLab. Despite their outstanding performance, those networks are not suited for our application with a very limited amount of processing power.

We are using ENet, a real-time image segmentation network, proposed by researchers at Purdue University and the University of Warsaw, Poland. Our implementation is done in Keras, as usual.

Screen Shot 2018-02-08 at 7.35.11 PM.png
This is directly taken from the ENet paper.

The benefit of ENet is that it can run at ~10fps on the Jetson. This is a huge benefit, especially in an application such as a self-driving cart. The network only has 300k parameters, but it performed very well on training.

Development & Training

We used the CityScapes dataset for training. Specifically, we used their extra_training data with extra_coarse labels. During training, we unsurprisingly realized that more data point will tremendously increase the performance.

However, we hit a couple of bumps during development.

  1. We decided to generate a CSV file with all the image paths in order to train the model. This way, the training generator stays simple.
  2. We also implemented helper methods to turn the group truth (label) image, (shape: width * height * 3), to a (width * height * the number of labels). In another word, each category, ex, human, cars, or road, will have its own layer in the output matrix.
  3. We also took out some of the unnecessary labels. It made training and inference a lot faster.

Figure Legend

Green: plants

Blue: vehicles/obstacles

Purple: Road

Red: human

Visualization on the CityScapes dataset

As you can see, after only a couple of hours of training, the network performed very well on the validation dataset.

An example of the training label from the cityscapes dataset. Note, this is fine annotation, not coarse annotation.


We are eager to implement this system on the vehicle. We will also program some rules for the vehicle to follow, once it understands its surroundings. If you have any questions or comments, please contact me at Thank you!

Posted by:NeilNie

Student at Columbia University, School of Engineering and Applied Sciences. Prev. software engineer intern at Apple. More than six years of experience developing iOS and macOS applications. Experienced in electrical engineering/microcontrollers. From publishing several apps to presented a TEDx talk in machine learning. Striving to use my knowledge, skills, and passion to positively impact the world.

12 replies on “Understanding the World Through Semantic Segmentation

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.