3D Perception & MLOps Infrastructure

About Me

Bridging 3D Perception & MLOps Infrastructure

Details:
- Partha P. Nath
- Machine Learning Engineer
- Vienna, AT
Background:
- 3D Vision & ML (TU Munich).
- CS, Math, Electronics, Drones, Simulations.
Core Mission: Solving the Hard Problems.
Technical Pillars: 3D Human Tracking, Object Pose Estimation, Pointcloud Encoding

1 / 13

ADL4D — Challenge & Scope

CV and Geometry Challenge

Goal: Capture Long-Horizon Activities
- Provide full fidelity 6D hand poses
- Multi-subject, multi-object, low fps vs
- Previous Datasets single hand/subject, single object, single action, high fps.
Challenges: Standard epipolar matching of landmarks fails with interhand occlusion across views
Result: Degenerate triangulation destroying data quality.

2 / 13

ADL4D — Innovation & Impact

ReID in 3D Space, unlocking pose capture

Innovation: Project ReID as clustering in 3D space w/wo temporal guidance
Outcome:
- Automated multi-subject tracking.
- Reduced untrackable frames from 1088 → 22 (Internal) and 6302 → 213 (H2O Benchmark).
- Captured the most diverse hand pose dataset with just basic activity sequences.
- 1.1M Frames of 20FPS annotated RGB-D paired with aligned MANO hands
- Strongest inter-dataset HMR generalization

3 / 13

ADL4D MLOps — Fragmentation & Infrastructure

Learning to isolate training

Training Challenges:
- Dependency, code, and orchestration variations across models, model-tasks, versions and GPU arch.
- Fragmented Cluster (A4000, A5000, A6000 nodes).
- GPU Efficacy
Actions:
- Docker Containerization
- WandB for experiment tracking, grouping and artifacts upload and code version checks
- Pytorch lightning or other hyperparameter sweep libraries for experiment jobs
- Fixing OSS codebases for challenges offered in later/our work
- Optimizing GPU bottlenecks with nvidia-smi, htop, wandb
- Standardizing DDP for all codebases

4 / 13

Cirqular — Zero to One

Pointcloud Segmentation Week 0–1

Context:
- Zero initial cloud resources.
- Building a LiDAR segmentation training pipeline from scratch
Action:
- Physically built an A6000 on-prem node.
- Identified and selected Pointcept (Point Transformer series) as the baseline
- Adapted loaders for processed data, ran sample trainings and confirmed metrics for top10 models using wandb
- Built the preprocessing workflow to ingest new raw LiDAR data
- Containerized and upgraded the training environment for latest dependencies (CUDA-Torch-PYG-SpConv)
- Deployed an on-prem ClearML server to optimize experiments tracking cost.

5 / 13

Cirqular — Feedback Loop

Weeks 2–3

Action:
- Automated sweeps for training recipes and model sizes to balance latency vs. accuracy.
- Model Zoo: Auto-loading tagged checkpoints from registry.
- Feedback Workflow: Users → Inference Team → Annotators+Trainers → Retraining.
- Integrated methods to combat immediately visible issues like class imbalances, slicing dimensions, early stopping

6 / 13

Cirqular — Consulting & Maintenance

Action:
- Retriggering for new data distributions.
- Precalculate Semantic re-weighing based on instance tracking.
- Fixing the Inference Stack for Docker/PyPA outages.
Outcome: Self-sustaining internal model zoo requiring minimal manual intervention.

7 / 13

RnD — SpatialLM

Scale & Abstraction

Context:
- Transitioned to GCP with steady cloud credits
- Segmentation training highly optimized
  - multi-node DDP, Gradient Accumulation, standalone Zero Optimisers Stage 1 & 2
- ClearML hosted globally in GCP tracking commit id + diff on every experiment
Subject:
- SpatialLM and Scenescript showcased promising results using structured language.
- Trained on internal (Meta) and Synthetic datasets
- No Released training code nor results of tuning for real data.
Goals:
- Replicate SpatialLM public results
- Consolidate annotations public datasets and internal
- Build the optimal VLM for our specific use case.

8 / 13

SpatialLM — Iteration 1

Action:
- Integration into Pointcept
  - Scenescript Encoder, Llama and Qwen language models
  - Basic Embeddings patching and vocab resizing utility
  - Data Conversion and Layout Annotation Processing for TBs of Public data
  - VLM build and Training code with initial Tests for pointcept encoders
- Identified and fixed critical issues in sorting determinism, quantization, & token limits
Result: Fixed and Retrained with far better results mimicking v1 public release

9 / 13

SpatialLM — Iter 2: Scaling, Standardisation & Abstraction

Action:
- Processed and integrated THOR, CV4AEC, Internal Datasets with S3 Streaming + Local Caching + Process Shard Id
- Extended necessary pointcept augmentations to handle layout data
- Iteratively train and rebalance
- Identify codebase chokepoints in distributed training
- Unify training under an extended HF trainer
  - FSDP, Checkpointing, Optimised Data Workers, and Multiple experiment trackers
- Setup deepspeed launcher script handling environment and source code forwarding
Result: Experiments became purely config-driven abstractions at massive scale
Validation: Validated by SpatialLM authors' second update mirroring our architecture.

10 / 13

SpatialLM — Future Proofing (Iter 3)

Self-Healing Infrastructure with Ray Train

Action:
- Investigated Ray Train / Scale AI to unify execution.
Why:
- Solving "Crash Anxiety" during week-long training runs.
- Naive integration with grafana board
- Identical in abstraction to deepspeed launching
Feature: Automatic GPU provisioning + Node Crash Recovery (Restart & Checkpoint management).

11 / 13

Summary of Competencies

Approximate Tech Stack

Compute: GCP, AWS, On-prem (A6000/A5000), Multi-node Clusters.
Orchestration: Docker, DeepSpeed, Ray Train (experimental).
Training Frameworks: PyTorch Lightning, HF Trainer, FSDP, DDP.
Experiment Tracking: WandB, ClearML (Self-hosted & Cloud).
Data Ops: S3 Streaming, Local Caching, Voxelization Pipelines.

12 / 13

Conclusion

Thank you for your time

13 / 13

1 / —