Have you ever dreamed of a Smart Pet which can be trained in modern world based on modern approaches?
⚠️ we are waiting for the product to be constructed, to be able to fit visual-encoders based on real-time observation data (RGB, Depth, etc.)⚠️
Prototype: Smart Pet Necklace for Command-based Navigation
The core purpose of this project is to build a Software 2.0 style model for my pet necklace to help him navigate in in-door environments based on my voice commands through vibrator in his/her necklace.
-
PointNet++ for 3D object detection and scene segmentation
-
using AI2THOR as indoor data generation.
-
VLN-CE Model for Vision-and-Language Navigation in Continuous Environments
-
command-based navigation system using (Matterport3D)[https://niessner.github.io/Matterport/#download] for Vision-and-Language Navigation in Continuous Environments(VLN-CE)
-
Improvement specific to VLN-CE:
- Using a custom environment for VLN-CE that implements the Gymnasium interface. This provides similar functionality to Habitat but with a simpler implementation.
- Replace simple cross-attention with cross-modal transformers like those in ViLBERT
- Use ViT for visual encoding to better capture object-level semantics.
- Use BERT, RoBERTa, or DistilBERT for language understanding. (DistilBERT is implemented but you can change config files based on your preferences
- Integrating explicit depth fusion strategies to visual encoders, you can specify the fusion strategy in the config file for each specific encoder.
- Integrating Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments into VLN-CE model for LAW supervision and tuning the DataLoader to split the dataset for better LAW supervision.
-
generate physics-based data using AI2THOR and integrate physics-informed models
-
implementing VLN_CE model using Habitat-Lab instead of Gymnasium
-
training on ScaleVLN dataset for better generalization. (needs GPU 🥲
-
provide actual images for training and testing. for now our visual encoder is using Dummy images in dataloader for lack of actual images and it makes the Visual-encoder to output Nan logits due to non uniform observatin.
-
Specific to VLN-CE:
- integrate Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments into VLN-CE model
- integrate EnvEdit: Environment Editing for Vision-and-Language Navigation into VLN-CE model for data augmentation
- integrate VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation for reducing computational costs
- integrate Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation
- using the navigation graph, or simulating the agent's pose step by step in LAW supervision.
- PointNet++ for 3D object detection and scene segmentation
- VLN-CE for Vision-and-Language Navigation in Continuous Environments
- Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments for VLN-CE
- AI2THOR for indoor data generation
- Habitat-Lab for indoor data generation
- changing simple CorssModalAttention to ViLBERT and CrossModalTransformer improved the generalization by 18.1%
- improving memory management, optimization strategy and tuning learning rate and initialization plus using pre-trained encoders like DistilBERT, ViT encoder and MobileNet lang-encoder dropped loss by 98% to 0.026 and lead to performance prior to last stage and basic strcuture in contrast with being more computationally expensive. memory management is more efficient and also leads to overfit. data augmentation is needed for better data and it's gonna be the next stage to tune dataset for more diverse set of instructions and pathways.
- Integrating explicit depth fusion strategies to visual encoders improved performance by 11.54% prior to the previous stage and dropped the loss to 0.023.
- Integrating LAW supervision improved performance by *94.78% prior to last stage and dropped loss to 0.0012, needs for GPU and Habitat are getting more to train on full-episodes and getting observation on training.
- to start training PointNet++ for 3D object detection and scene segmentation on AI2THOR dataset:
python3 script/train.py --conifg configs/PointNetPP.json --model PointNetPP
- to start training VLN-CE for Vision-and-Language Navigation in Continuous Environments on VLN-CE dataset:
python3 script/VLN_CE/main.py