Abstract
Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a dataset of 26,678 synchronized front–rear image pairs demonstrates 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India, with only 8.29% performance degradation using few-shot adaptation and minimal training samples.