Skip to content

UT-ADL/milrem_visual_offroad_navigation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision-based off-road navigation with geographical hints

Summary

Company name Milrem Robotics
Project Manager Meelis Leib
Systems Architect Erik Ilbis
Company name Autonomous Driving Lab, Institute of Computer Science, University of Tartu
Team lead Tambet Matiisen
Data collection Kertu Toompea
Model training Romet Aidla
Robot integration Anish Shrestha
Map preparation Edgar Sepp

Objectives of the Demonstration Project

The goal of the project is to collect and validate dataset for vision-based off-road navigation with geographical hints.

Milrem UGV must to be able to navigate:

  • in unstructured environment (no buildings, roads or other landmarks),
  • with passive sensors (using only camera and GNSS, active sensors make the UGV discoverable),
  • with no prior map or with outdated map,
  • with unreliable satellite positioning signals.

System that satisfies the above goals was proposed in the ViKiNG paper by Dhruv Shah and Sergey Levine from University of California, Berkeley. The paper demonstrated vision-based kilometer-scale navigation with geographical hints in semi-structured urban environments, including parks. The goal of this project was to extend the ViKiNG solution to unstructured off-road environments, for example forests.

Examples of the desired environment:

forest1 forest2 forest3

Activities and results of demonstration project

Challenge adressed

The goal of using passive sensors means that the camera is the primary sensor. The currently best known way to make sense of camera images is to use artificial neural networks. These networks need a lot of training data to work well. Therefore the main goal of this project was to collect and validate the data to train artificial neural networks for vision-based navigation.

We set ourselves a goal to collect 50 hours of data consisting of 150 km of trajectories. This was inspired by the ViKiNG paper having 42 hours of training data. Time-wise this goal was achieved, distance-wise 104 km was collected.

In addition to collecting the data we wanted to validate if it is usable for training the neural networks. We actually went further than that by not only training the networks, but also implementing a proof-of-concept navigation system on Jackal robot.

Jackal UGV

Data sources

The data was collected from April 12th till October 6th, 2023 from 27 orienteering events and 20 self-guided sessions around Tartu, Estonia. Details of the places and weather conditions can be found in this table.

Data collection was performed with golf trolley fitted with the following sensors:

trolley1 trolley2 trolley3 trolley4

Four different types of data was collected:

  1. camera images,
  2. visual odometry (trajectories derived from camera movement),
  3. GPS trajectories,
  4. georeferenced maps.

Following types of maps were acquired and georeferenced:

Map type Example image
orienteering maps (usually from organizers, sometimes from Estonian O-Map) orienteering map
Estonian base map (from Estonian Land Board) Estonian base map
Estonian base map with elevation (from Estonian Land Board) Estonian base map with elevation
Estonian orthophoto (from Estonian Land Board) Estonian orthophoto
Google satellite photo (from Google Maps Static API) Google satellite photo
Google road map (from Google Maps Static API) Google road map
Google hybrid map (from Google Maps Static API) Google hybrid map

Further cleaning was applied to the data with following sections removed:

  • Missing odometry data
  • Big change in position: >1.0m per timestep
  • Low velocity: <0.05 m/s
  • High velocity: >2.5 m/s
  • Model prediction errors were analyzed
  • Bad trajectories
  • Missing or bad camera images

Altogether this resulted in 94.4 km of trajectories used for training.

In addition the dataset for local planner was combined with RECON dataset of 40 hours of autonomously collected trajectories.

Description of AI technology

The system makes use of two neural networks: local planner and global planner.

Local planner takes a camera image and predicts next waypoints, where the robot can drive without hitting obstacles.

Inputs to the model Outputs of the model
  • Current camera image
  • Past 5 camera images for context
  • Goal image
  • Trajectory of 5 waypoints
  • Temporal distance to the goal

The local planner is trained using camera images and visual odometry. The goal image was taken as an image from fixed timesteps from the future. Temporal distance to the goal represents the number of timesteps to the goal image.

Local planner

Global planner takes the waypoints proposed by the local planner and estimates which of them are likely on the path to the final goal.

Inputs to the model Outputs of the model
  • Overhead map
  • Current location
  • Goal location
  • Probabilities whether each map pixel is
    on the path from current location to goal

The global planner is trained using georeferenced maps and GPS trajectories - given two points on the trajectory, all points in-between were marked as high-probability points.

Global planner

These two models work in coordination to handle outdated maps and inaccurate GPS:

  • as long as the local planner proposes valid waypoints the robot never collides with obstacles,
  • as the global planner picks waypoints which are on the path to the final destination, it tends to move towards the final goal, even if the GPS positioning is wrong or the map is outdated.

Results of validation

Local planner

For local planner following network architectures were considered:

Model Pretrained weights Trained or finetuned On-policy tested Generative Waypoint proposal method
VAE - + + + Sampling from latent representation
GNM + + + - Cropping the current observation
ViNT + - + + Goal image diffusion
NoMaD + - - + Trajectory diffusion

VAE model was trained from scratch, all other models were used with pre-trained weights from Berkeley group. GNM model was additionally fine-tuned with our own dataset.

The models were tested both off-policy and on-policy. Off-policy means that the model was applied to recorded data, the model's predicted actions were just visualized, but not actuated. On-policy means that the model’s predicted actions were actually actuated on the robot.

For on-policy testing we recorded a fixed route, took goal images at fixed intervals and measured success rate in navigating to every goal image along the route. Basically it shows how well the model understands the direction of goal image and how well detect it can detect if the goal was reached. The operator intervened when the robot was going completely off the path and guided it back to the track. Sometimes the robot failed to detect the goal, but was driving in the right direction and successfully recognized the subsequent goal. Then the goal was not marked as achieved, but no intervention was necessary.

Off-policy results

The videos below show models applied to pre-recorded data. In the videos green trajectory represents ground truth, red trajectory represents goal-conditioned predicted trajectory (many in case of NoMaD), blue represents sampled possible trajectories (in case of VAE).

Model Video
VAE VAE
GNM finetuned GNM finetuned
ViNT ViNT
NoMaD with goal images at fixed intervals NoMaD goal
NoMaD with one fixed goal (exploratory mode) NoMaD explore
NoMaD orienteering NoMaD orienteering

Comments:

  • VAE prefers going straight, probably because of too homogeneous training dataset. GNM and ViNT turn slightly less compared to the ground truth trajectory, but that is not necessarily a problem when running the models on-policy. NoMaD seems to turn the most.
  • GNM and VAE are trained with time-interval trajectories that shorten close to goal or obstacle. ViNT and NoMaD seem to be trained with distance-interval trajectories that do not shorten. Distance prediction shortens with all models near the goal.
  • VAE and NoMaD can directly produce multiple candidate trajectories. VAE trajectories are only conditioned on observation, NoMaD trajectories are additionally conditioned on goal. NoMaD trajectories show some multi-modal behavior (passing the tree from both sides).
  • For GNM and ViNT the only way to generate multiple trajectories is to use different goal images. The image diffusion approach used in ViNT paper seemed overkill to us, so we experimented instead using crops of the observation images. Some examples can be seen below in the Putting all together section.
On-policy results indoors

We recorded a fixed route in Delta office with goal images every 1 or 2 meters and measured the goal success rate for each interval.

Model Goal interval Number of goal images Number of interventions Success rate Video
GNM 1m 30 0 90.00 video
GNM finetuned 1m 30 1 93.33 video
ViNT 1m 30 2 96.67 video
GNM 2m 15 0 86.67 video
GNM finetuned 2m 15 0 93.33 video
ViNT 2m 15 0 93.33 video

Comments:

  • VAE cannot be used indoors, because its training data did not include indoor scenes.
  • NoMaD integration with robot was not finished by the time of testing.
  • Using bigger than 2m intnervals indoors was pointless - because of too steep turns the goal wouldn't be within line of sight.
  • GNM finetuned did better than vanilla GNM, possibly due to finetuning dataset having the same camera as during testing.

Example video of top-performing model (GNM finetuned) at 4X speed:

GNM finetuned indoors

On-policy results outdoors

We recorded a fixed route in Delta park with goal images every 2, 5 or 10 meters and measured the goal success rate for each interval.

Model Goal interval Number of goal images Number of interventions Success rate Video
GNM 2m 38 1 86.84 video
GNM finetuned 2m 38 0 81.58 video
GNM finetuned 5m 17 7 100 video
ViNT 5m 17 7 100 video
ViNT 10m 8 9 100 video

Comments:

  • Vanilla GNM goal recognition was slightly more reliable than with GNM finetuned, but GNM finetuned was more reliable on staying on the track.
  • While at 5m intervals the models achieved 100% success rate, this came at the expense of 7 interventions - for half of the goals the operator had to help the robot to achieve it.
  • At 10m intervals the only tested model ViNT was basically useless - it had more interventions than goals. Basically for each goal the operator had to give a hand to the robot.
  • Goal distances predicted outdoors seemed to be in general longer than distances predicted indoors, i.e. we had to use different "goal achieved" threshold indoors and outdoors.
  • NoMaD integration with robot was not finished by the time of testing.
  • VAE did not perform reasonably well to be included in the table. It tended to go straight all the time.

Example video of top-performing model (GNM finetuned) at 4X speed:

GNM finetuned outdoors

Summary of local planner results
Model Goal following Turning Obstacle avoidance Trail following Trajectory diversity Trajectory multi-modality
VAE limited poor limited good limited poor
GNM finetuned good limited good limited - -
ViNT good limited limited limited - -
NoMaD good good good good good limited

Global planner

For global planner following network architectures were considered:

As the U-Net approach worked much better, the contrastive approach was abandoned. Most of the experimentation was done with the base map with elevation.

Following videos show on-policy simulation where the robot proposes a number of random waypoints and then moves towards the one that has the highest probability. Blue dot shows the robot current location and yellow dot is the goal location.

Location Video
Ihaste Ihaste
Kärgandi, sticking to the road Kärgandi
Annelinn, avoidance of houses Annelinn

Following videos show different behavior for different map modalities.

Location Video
Base map - sticking to the road Base map
Road map - going straight (not enough context) Road map
Orthophoto - mostly sticking to the road Orthophoto

Putting it all together

Following video shows off-policy evaluation of the whole system on a recorded session. Colored trajectories are produced with crops of the original camera image used as goal, as shown in the video. White trajectory comes from the final goal.

Delta park off-policy final

On-policy evaluation of the whole system was not possible due to some technical difficulties with the GNSS sensor and due to winter making the use of the models pointless, because they were mainly trained on summer data.

Technical architecture

For local planner following network architectures were tried:

For global planner following network architectures were tried:

Potential areas of use

The working solution could be used in any area that needs navigation in unstructured environment with poor GPS signal and outdated maps, for example:

  • military,
  • agriculture,
  • forestry,
  • rescue.

The dataset collected in this project can also be used to create a visual navigation benchmark and international robot orienteering competition. Such competition would make novel solutions and international talent accessible to Milrem Robotics.

Lessons learned

For training the local planner the dataset seemed insufficient or contained too simple trajectories (moving mostly forward). Even after combining our data with RECON dataset or fine-tuning existing models, the results were inconclusive - sometimes the fine-tuned model was performing better, sometimes worse than the original. The original general navigation models were also unreliable, they were not always able to avoid the obstacles. More work is needed to make visual navigation reliable.

Alternative model outputs could be considered, e.g. predicting free space instead of trajectories and proposing waypoints from that free space. Also collection of more explorative data directly with the robot might be necessary, as in the ViKiNG paper they used mainly automatically collected exploratory data (30 hours) and relatively few expert trajectories (12 hours). In our case all of the data was expert trajectories.

Global planner trained much better and was able to estimate reasonably well the recommended path between two points. We also observed different behavior for different map modalities, e.g. base map and road map. More work is needed to reduce the artifacts produced by the fully convolutional network and some map modalities might need further tuning.

Final takeaways:

  • Training neural networks in 2023 is still hard.
  • Dataset curation is non-trivial and less documented than model training.
  • Should use (or fine-tune) pre-trained models whenever available.
  • Off-policy performance (on recordings) does not match on-policy performance (on robot).

Description of User Interface

Delta park off-policy final

  • The screen shows current camera image and proposed trajectories. White trajectory represents the trajectory induced by the goal image at top right.
  • Bottom right shows the probability map (the path from current position to goal) and original map. Waypoint colors match the trajectory colors.
  • The left pane shows the robot command.

About

Vision-based off-road navigation with geographical hints

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published