Physics-informed machine learning for cloud detection

Published in Remote Sensing of Environment, 2025

Abstract: Accurate cloud detection is a critical preprocessing step for utilizing Landsat and Sentinel-2 data in a wide range of remote sensing applications. While existing methods, including physical-rule-based and machine-learning-based approaches, have shown promise, they often struggle to reliably distinguish clouds from spectrally similar surfaces (e.g., snow/ice, coastal, high mountain, and urban areas) or require extensive, globally representative training data. This study introduces Fmask 5 (Version 5 of Function of mask), offering seven different cloud detection models for Landsats 4-9 (Landsat 4, 5, 7, 8, and 9) and Sentinel-2 (Sentinel-2 A, B, and C) imagery, by integrating physical rules with machine learning or using each approach independently. Fmask 5 synergistically integrates physical rules, adapted from previous Fmask versions, with machine learning (ML) models in a unique feedback loop, resulting in a novel physics-informed machine learning (PIML) framework. The key innovation is the use of dynamically generated, localized training data derived from image-specific physical rule applications. This PIML addresses the limitations of globally trained ML models by adapting to the specific atmospheric and surface conditions of each scene. Specifically, a base pixel-based shallow ML (Light Gradient-Boosting Machine - LightGBM) or CNN-based deep ML (UNet) model is first applied to generate a preliminary cloud mask. This mask is then used to calibrate the application of physical rules, determining the optimal combination of spectral tests and thresholds for that particular image. The resulting physical-rule-calibrated cloud mask serves as refined training data to fine-tune and optimize the ML models. We strategically combine the pixel-based strengths of LightGBM with the spatial pattern recognition capabilities of the CNN-based UNet in a series of experiments. The independent validation across the globe demonstrates that Fmask5-PIML substantially outperforms methods relying solely on physical rules or machine learning, achieving overall accuracies 93% for Landsats 8-9 and 94% for Sentinel-2, and 92% for Landsats 4-7 (Landsat 4, 5 and 7). Importantly, Fmask5-PIML models achieve high accuracy with CPU-based computational efficiency, making it suitable for operational, large-scale global processing. The proposed framework represents a significant advancement in cloud detection methodology, demonstrating the broader potential of PIML in remote sensing, and enabling higher level products (analysis-ready data, composites, etc.) for numerous applications.

Recommended citation: Shi Qiu, Zhe Zhu, Xiucheng Yang, Junchang Ju, Qiang Zhou, and Christopher S.R. Neigh (2025). "Physics-informed machine learning for cloud detection." Remote Sensing of Environment. in Revise.