Vision-based off-road navigation with geographical hints

Summary

Company name	Milrem Robotics
Project Manager	Meelis Leib
Systems Architect	Erik Ilbis

Company name	Autonomous Driving Lab, Institute of Computer Science, University of Tartu
Team lead	Tambet Matiisen
Data collection	Kertu Toompea
Model training	Romet Aidla
Robot integration	Anish Shrestha
Map preparation	Edgar Sepp

Objectives of the Demonstration Project

The goal of the project is to collect and validate dataset for vision-based off-road navigation with geographical hints.

Milrem UGV must to be able to navigate:

in unstructured environment (no buildings, roads or other landmarks),
with passive sensors (using only camera and GNSS, active sensors make the UGV discoverable),
with no prior map or with outdated map,
with unreliable satellite positioning signals.

System that satisfies the above goals was proposed in the ViKiNG paper by Dhruv Shah and Sergey Levine from University of California, Berkeley. The paper demonstrated vision-based kilometer-scale navigation with geographical hints in semi-structured urban environments, including parks. The goal of this project was to extend the ViKiNG solution to unstructured off-road environments, for example forests.

Examples of the desired environment:

Activities and results of demonstration project

Challenge adressed

The goal of using passive sensors means that the camera is the primary sensor. The currently best known way to make sense of camera images is to use artificial neural networks. These networks need a lot of training data to work well. Therefore the main goal of this project was to collect and validate the data to train artificial neural networks for vision-based navigation.

We set ourselves a goal to collect 50 hours of data consisting of 150 km of trajectories. This was inspired by the ViKiNG paper having 42 hours of training data. Time-wise this goal was achieved, distance-wise 104 km was collected.

In addition to collecting the data we wanted to validate if it is usable for training the neural networks. We actually went further than that by not only training the networks, but also implementing a proof-of-concept navigation system on Jackal robot.

Data sources

The data was collected from April 12th till October 6th, 2023 from 27 orienteering events and 20 self-guided sessions around Tartu, Estonia. Details of the places and weather conditions can be found in this table.

Data collection was performed with golf trolley fitted with the following sensors:

ZED 2i stereo camera
Xsens MTI-710G GNSS/INS device
3x GoPro cameras at three different heights

Four different types of data was collected:

camera images,
visual odometry (trajectories derived from camera movement),
GPS trajectories,
georeferenced maps.

Following types of maps were acquired and georeferenced:

Map type	Example image
orienteering maps (usually from organizers, sometimes from Estonian O-Map)
Estonian base map (from Estonian Land Board)
Estonian base map with elevation (from Estonian Land Board)
Estonian orthophoto (from Estonian Land Board)
Google satellite photo (from Google Maps Static API)
Google road map (from Google Maps Static API)
Google hybrid map (from Google Maps Static API)

Further cleaning was applied to the data with following sections removed:

Missing odometry data
Big change in position: >1.0m per timestep
Low velocity: <0.05 m/s
High velocity: >2.5 m/s
Model prediction errors were analyzed
Bad trajectories
Missing or bad camera images

Altogether this resulted in 94.4 km of trajectories used for training.

In addition the dataset for local planner was combined with RECON dataset of 40 hours of autonomously collected trajectories.

Description of AI technology

The system makes use of two neural networks: local planner and global planner.

Local planner takes a camera image and predicts next waypoints, where the robot can drive without hitting obstacles.

Inputs to the model	Outputs of the model
Current camera image Past 5 camera images for context Goal image	Trajectory of 5 waypoints Temporal distance to the goal

The local planner is trained using camera images and visual odometry. The goal image was taken as an image from fixed timesteps from the future. Temporal distance to the goal represents the number of timesteps to the goal image.

Global planner takes the waypoints proposed by the local planner and estimates which of them are likely on the path to the final goal.

Inputs to the model	Outputs of the model
Overhead map Current location Goal location	Probabilities whether each map pixel is on the path from current location to goal

The global planner is trained using georeferenced maps and GPS trajectories - given two points on the trajectory, all points in-between were marked as high-probability points.

These two models work in coordination to handle outdated maps and inaccurate GPS:

as long as the local planner proposes valid waypoints the robot never collides with obstacles,
as the global planner picks waypoints which are on the path to the final destination, it tends to move towards the final goal, even if the GPS positioning is wrong or the map is outdated.

Results of validation

Local planner

For local planner following network architectures were considered:

Model	Pretrained weights	Trained or finetuned	On-policy tested	Generative	Waypoint proposal method
VAE	-	+	+	+	Sampling from latent representation
GNM	+	+	+	-	Cropping the current observation
ViNT	+	-	+	+	Goal image diffusion
NoMaD	+	-	-	+	Trajectory diffusion

VAE model was trained from scratch, all other models were used with pre-trained weights from Berkeley group. GNM model was additionally fine-tuned with our own dataset.

The models were tested both off-policy and on-policy. Off-policy means that the model was applied to recorded data, the model's predicted actions were just visualized, but not actuated. On-policy means that the model’s predicted actions were actually actuated on the robot.

For on-policy testing we recorded a fixed route, took goal images at fixed intervals and measured success rate in navigating to every goal image along the route. Basically it shows how well the model understands the direction of goal image and how well detect it can detect if the goal was reached. The operator intervened when the robot was going completely off the path and guided it back to the track. Sometimes the robot failed to detect the goal, but was driving in the right direction and successfully recognized the subsequent goal. Then the goal was not marked as achieved, but no intervention was necessary.

Off-policy results

The videos below show models applied to pre-recorded data. In the videos green trajectory represents ground truth, red trajectory represents goal-conditioned predicted trajectory (many in case of NoMaD), blue represents sampled possible trajectories (in case of VAE).

Model	Video
VAE
GNM finetuned
ViNT
NoMaD with goal images at fixed intervals
NoMaD with one fixed goal (exploratory mode)
NoMaD orienteering

Comments:

VAE prefers going straight, probably because of too homogeneous training dataset. GNM and ViNT turn slightly less compared to the ground truth trajectory, but that is not necessarily a problem when running the models on-policy. NoMaD seems to turn the most.
GNM and VAE are trained with time-interval trajectories that shorten close to goal or obstacle. ViNT and NoMaD seem to be trained with distance-interval trajectories that do not shorten. Distance prediction shortens with all models near the goal.
VAE and NoMaD can directly produce multiple candidate trajectories. VAE trajectories are only conditioned on observation, NoMaD trajectories are additionally conditioned on goal. NoMaD trajectories show some multi-modal behavior (passing the tree from both sides).
For GNM and ViNT the only way to generate multiple trajectories is to use different goal images. The image diffusion approach used in ViNT paper seemed overkill to us, so we experimented instead using crops of the observation images. Some examples can be seen below in the Putting all together section.

On-policy results indoors

We recorded a fixed route in Delta office with goal images every 1 or 2 meters and measured the goal success rate for each interval.

Model	Goal interval	Number of goal images	Number of interventions	Success rate	Video
GNM	1m	30	0	90.00	video
GNM finetuned	1m	30	1	93.33	video
ViNT	1m	30	2	96.67	video
GNM	2m	15	0	86.67	video
GNM finetuned	2m	15	0	93.33	video
ViNT	2m	15	0	93.33	video

Comments:

VAE cannot be used indoors, because its training data did not include indoor scenes.
NoMaD integration with robot was not finished by the time of testing.
Using bigger than 2m intnervals indoors was pointless - because of too steep turns the goal wouldn't be within line of sight.
GNM finetuned did better than vanilla GNM, possibly due to finetuning dataset having the same camera as during testing.

Example video of top-performing model (GNM finetuned) at 4X speed:

On-policy results outdoors

We recorded a fixed route in Delta park with goal images every 2, 5 or 10 meters and measured the goal success rate for each interval.

Model	Goal interval	Number of goal images	Number of interventions	Success rate	Video
GNM	2m	38	1	86.84	video
GNM finetuned	2m	38	0	81.58	video
GNM finetuned	5m	17	7	100	video
ViNT	5m	17	7	100	video
ViNT	10m	8	9	100	video

Comments:

Vanilla GNM goal recognition was slightly more reliable than with GNM finetuned, but GNM finetuned was more reliable on staying on the track.
While at 5m intervals the models achieved 100% success rate, this came at the expense of 7 interventions - for half of the goals the operator had to help the robot to achieve it.
At 10m intervals the only tested model ViNT was basically useless - it had more interventions than goals. Basically for each goal the operator had to give a hand to the robot.
Goal distances predicted outdoors seemed to be in general longer than distances predicted indoors, i.e. we had to use different "goal achieved" threshold indoors and outdoors.
NoMaD integration with robot was not finished by the time of testing.
VAE did not perform reasonably well to be included in the table. It tended to go straight all the time.

Example video of top-performing model (GNM finetuned) at 4X speed:

Summary of local planner results

Model	Goal following	Turning	Obstacle avoidance	Trail following	Trajectory diversity	Trajectory multi-modality
VAE	limited	poor	limited	good	limited	poor
GNM finetuned	good	limited	good	limited	-	-
ViNT	good	limited	limited	limited	-	-
NoMaD	good	good	good	good	good	limited

Global planner

For global planner following network architectures were considered:

As the U-Net approach worked much better, the contrastive approach was abandoned. Most of the experimentation was done with the base map with elevation.

Following videos show on-policy simulation where the robot proposes a number of random waypoints and then moves towards the one that has the highest probability. Blue dot shows the robot current location and yellow dot is the goal location.

Location	Video
Ihaste
Kärgandi, sticking to the road
Annelinn, avoidance of houses

Following videos show different behavior for different map modalities.

Location	Video
Base map - sticking to the road
Road map - going straight (not enough context)
Orthophoto - mostly sticking to the road

Putting it all together

Following video shows off-policy evaluation of the whole system on a recorded session. Colored trajectories are produced with crops of the original camera image used as goal, as shown in the video. White trajectory comes from the final goal.

On-policy evaluation of the whole system was not possible due to some technical difficulties with the GNSS sensor and due to winter making the use of the models pointless, because they were mainly trained on summer data.

Technical architecture

For local planner following network architectures were tried:

VAE (as in the original ViKiNG paper)
GNM
ViNT
NoMaD

For global planner following network architectures were tried:

contrastive MLP (as in the original ViKiNG paper)
U-Net

Potential areas of use

The working solution could be used in any area that needs navigation in unstructured environment with poor GPS signal and outdated maps, for example:

military,
agriculture,
forestry,
rescue.

The dataset collected in this project can also be used to create a visual navigation benchmark and international robot orienteering competition. Such competition would make novel solutions and international talent accessible to Milrem Robotics.

Lessons learned

For training the local planner the dataset seemed insufficient or contained too simple trajectories (moving mostly forward). Even after combining our data with RECON dataset or fine-tuning existing models, the results were inconclusive - sometimes the fine-tuned model was performing better, sometimes worse than the original. The original general navigation models were also unreliable, they were not always able to avoid the obstacles. More work is needed to make visual navigation reliable.

Alternative model outputs could be considered, e.g. predicting free space instead of trajectories and proposing waypoints from that free space. Also collection of more explorative data directly with the robot might be necessary, as in the ViKiNG paper they used mainly automatically collected exploratory data (30 hours) and relatively few expert trajectories (12 hours). In our case all of the data was expert trajectories.

Global planner trained much better and was able to estimate reasonably well the recommended path between two points. We also observed different behavior for different map modalities, e.g. base map and road map. More work is needed to reduce the artifacts produced by the fully convolutional network and some map modalities might need further tuning.

Final takeaways:

Training neural networks in 2023 is still hard.
Dataset curation is non-trivial and less documented than model training.
Should use (or fine-tune) pre-trained models whenever available.
Off-policy performance (on recordings) does not match on-policy performance (on robot).

Description of User Interface

The screen shows current camera image and proposed trajectories. White trajectory represents the trajectory induced by the goal image at top right.
Bottom right shows the probability map (the path from current position to goal) and original map. Waypoint colors match the trajectory colors.
The left pane shows the robot command.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
dataset		dataset
global_planner		global_planner
images		images
local_planner		local_planner
robot		robot
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision-based off-road navigation with geographical hints

Summary

Objectives of the Demonstration Project