InSpaceType: A Dataset and Analysis Tool for Space Type in Indoor Monocular Depth Estimation

* Data

[Sample data]: This contains 167MB sample data
[InSpaceType Eval set]: This contains 1260 RGBD pairs for evaluation use about 11.5G. For evaluation please go to our codebase
[InSpaceType all data]:This contains 40K RGBD pairs, about 500G the whole InSpaceType dataset. The whole data is split into 8 chunks. Please download all chunks in the folder and extract them.

* TL;DR

This work introduces a dataset and benchmark that reconsiders an important but usually overlooked factor- space type. We detailedly analyze 12 SOTA models and four popular training dataset to unveil their potential biases. Further, we study cross-type generalization and domain generalization techniques.

* Abstract

Indoor monocular depth estimation attracted higher research interest in indoor robots to aid navigation and perception. Most previous methods primarily experiment with NYUv2 Dataset and concentrate on the overall evaluation performance. However, little is known regarding robustness and generalization in the real world where highly varying and diverse functional \textit{space types}, such as library or kitchen, are present as tailed types. This work studies the common but easily overlooked factor- space type and realizes a model's performance variance. We present InSpaceType Dataset, a high-quality and high-resolution RGBD dataset for general indoor environments. We study 12 recent methods on InSpaceType and find most of them severely suffer from performance imbalance between head and tailed types and reveal their underlying bias. We extend analysis to total 4 datasets and organize their characteristics to enlighten further research directions on proper usage of them. Further, we study interplays between types and generalization to unseen spaces. Our work marks the first in-depth investigation of performance variance across space types and, more importantly, releases useful tools, including datasets and codes, to closely examine a given pretrained model.

* Analysis I-II [Benchmark on overall performance and space type breakdown]

InSpaceType benchmark: overall performance. The best number is in bold, and the second-best (by method) is underlined. $^*$ denotes self-supervised learning.

First table studies three top methods among those trained only on NYUv2 for depth estimation only (N-Depth only): MIM, PixelFormer, and NeWCRFs. Second table studies three top methods among those \textit{pretrained on multiple datasets or learned from large-scale pertaining (M\&LS-Pre) then finetuned on NYUv2: ZoeDepth and VPD. Beside the breakdown, we also list top-5 space types based on lower/higher error (RMSE) and accuracy ($\delta_1$). Easy and hard types are listed based on co-occurrence in the top-5 list.

* Analysis III [More training datasets]

Space type breakdown and characteristics for SimSIN, UniSIN, and Hypersim Dataset.

* Analysis IV [Cross-group generalization]

Cross-group generalization evaluation. G1$\to$ specifies a training group (G), and each row below shows evaluation groups. Another three ranges, close, medium, and far, are used to show a breakdown by different ranges.

* Conclusion

We are the first to consider space types in indoor monocular depth estimation. We point out limitations in previous works where performance variances across types are overlooked and then present a novel dataset, InSpaceType, along with a hierarchical space type definition to facilitate our study. Twelve recent high-performing methods are examined by cross-dataset evaluation, including overall performance and space type breakdown. They more or less suffer from performance imbalance between space types, and we find that strategies like mixed-dataset pretraining or learning from large-scale pretraining show better robustness. Our analysis investigates a total of 4 training datasets and organizes their characteristics to better guide future research direction using these datasets. In particular, we find current synthetic data curation cannot faithfully reflect the real-world high complexity on cluttered and small objects in the near field. Going in-depth analysis of interplay between space types, we show generalization between groups. Specifically, generalization to different depth ranges is harder than to scene appearance and arrangement change. We further inspect three approaches, class reweighting, type-balanced sampler, and meta-learning to enhance the generalization.
Overall, this work pursues a practical purpose and emphasizes the importance of this usually overlooked factor- space type in indoor environments. We release the analysis tool, including codes and datasets, to diagnose a pretrained model and show metrics for hierarchical space type breakdown. We believe such a tool is useful to broad interest in the robotic vision community. Note that the work focuses on monocular depth estimation as it is fundamental and useful in wide applications such as indoor robots or AR/VR. Multiview reconstruction may also need attention on space type as a future study.