- Details:
- Partha P. Nath
- Machine Learning Engineer
- Vienna, AT
- Background:
- 3D Vision & ML (TU Munich).
- CS, Math, Electronics, Drones, Simulations.
- Core Mission: Solving the Hard Problems.
- Technical Pillars: 3D Human Tracking, Object Pose Estimation, Pointcloud Encoding
1 / 13
- Goal: Capture Long-Horizon Activities
- Provide full fidelity 6D hand poses
- Multi-subject, multi-object, low fps vs
- Previous Datasets single hand/subject, single object, single action, high fps.
- Challenges: Standard epipolar matching of landmarks fails with interhand occlusion across views
- Result: Degenerate triangulation destroying data quality.
2 / 13 - Innovation: Project ReID as clustering in 3D space w/wo temporal guidance
- Outcome:
- Automated multi-subject tracking.
- Reduced untrackable frames from 1088 → 22 (Internal) and 6302 → 213 (H2O Benchmark).
- Captured the most diverse hand pose dataset with just basic activity sequences.
- 1.1M Frames of 20FPS annotated RGB-D paired with aligned MANO hands
- Strongest inter-dataset HMR generalization
3 / 13
- Training Challenges:
- Dependency, code, and orchestration variations across models, model-tasks, versions and GPU arch.
- Fragmented Cluster (A4000, A5000, A6000 nodes).
- GPU Efficacy
- Actions:
- Docker Containerization
- WandB for experiment tracking, grouping and artifacts upload and code version checks
- Pytorch lightning or other hyperparameter sweep libraries for experiment jobs
- Fixing OSS codebases for challenges offered in later/our work
- Optimizing GPU bottlenecks with
nvidia-smi, htop, wandb - Standardizing DDP for all codebases
4 / 13 - Context:
- Zero initial cloud resources.
- Building a LiDAR segmentation training pipeline from scratch
- Action:
- Physically built an A6000 on-prem node.
- Identified and selected Pointcept (Point Transformer series) as the baseline
- Adapted loaders for processed data, ran sample trainings and confirmed metrics for top10 models using wandb
- Built the preprocessing workflow to ingest new raw LiDAR data
- Containerized and upgraded the training environment for latest dependencies (CUDA-Torch-PYG-SpConv)
- Deployed an on-prem ClearML server to optimize experiments tracking cost.
5 / 13 - Action:
- Automated sweeps for training recipes and model sizes to balance latency vs. accuracy.
- Model Zoo: Auto-loading tagged checkpoints from registry.
- Feedback Workflow: Users → Inference Team → Annotators+Trainers → Retraining.
- Integrated methods to combat immediately visible issues like class imbalances, slicing dimensions, early stopping
6 / 13 - Action:
- Retriggering for new data distributions.
- Precalculate Semantic re-weighing based on instance tracking.
- Fixing the Inference Stack for Docker/PyPA outages.
- Outcome: Self-sustaining internal model zoo requiring minimal manual intervention.
7 / 13 - Context:
- Transitioned to GCP with steady cloud credits
- Segmentation training highly optimized
- multi-node DDP, Gradient Accumulation, standalone Zero Optimisers Stage 1 & 2
- ClearML hosted globally in GCP tracking commit id + diff on every experiment
- Subject:
- SpatialLM and Scenescript showcased promising results using structured language.
- Trained on internal (Meta) and Synthetic datasets
- No Released training code nor results of tuning for real data.
- Goals:
- Replicate SpatialLM public results
- Consolidate annotations public datasets and internal
- Build the optimal VLM for our specific use case.
8 / 13 - Action:
- Integration into Pointcept
- Scenescript Encoder, Llama and Qwen language models
- Basic Embeddings patching and vocab resizing utility
- Data Conversion and Layout Annotation Processing for TBs of Public data
- VLM build and Training code with initial Tests for pointcept encoders
- Identified and fixed critical issues in sorting determinism, quantization, & token limits
- Result: Fixed and Retrained with far better results mimicking v1 public release
9 / 13 - Action:
- Processed and integrated THOR, CV4AEC, Internal Datasets with S3 Streaming + Local Caching + Process Shard Id
- Extended necessary pointcept augmentations to handle layout data
- Iteratively train and rebalance
- Identify codebase chokepoints in distributed training
- Unify training under an extended HF trainer
- FSDP, Checkpointing, Optimised Data Workers, and Multiple experiment trackers
- Setup deepspeed launcher script handling environment and source code forwarding
- Result: Experiments became purely config-driven abstractions at massive scale
- Validation: Validated by SpatialLM authors' second update mirroring our architecture.
10 / 13 - Action:
- Investigated Ray Train / Scale AI to unify execution.
- Why:
- Solving "Crash Anxiety" during week-long training runs.
- Naive integration with grafana board
- Identical in abstraction to deepspeed launching
- Feature: Automatic GPU provisioning + Node Crash Recovery (Restart & Checkpoint management).
11 / 13 - Compute: GCP, AWS, On-prem (A6000/A5000), Multi-node Clusters.
- Orchestration: Docker, DeepSpeed, Ray Train (experimental).
- Training Frameworks: PyTorch Lightning, HF Trainer, FSDP, DDP.
- Experiment Tracking: WandB, ClearML (Self-hosted & Cloud).
- Data Ops: S3 Streaming, Local Caching, Voxelization Pipelines.
12 / 13 1 / —