Deep Learning Techniques for Visual Counting
The explosion of Deep Learning (DL) added a boost to the already rapidly developing field of Computer Vision to such a point that vision-based tasks are now parts of our everyday lives. Applications such as image classification, photo stylization, or face recognition are nowadays pervasive, as evidenced by the advent of modern systems trivially integrated into mobile applications. In this thesis, we investigated and enhanced the visual counting task, which automatically estimates the number of objects in still images or video frames. Recently, due to the growing interest in it, several Convolutional Neural Network (CNN)-based solutions have been suggested by the scientific community. These artificial neural networks, inspired by the organization of the animal visual cortex, provide a way to automatically learn effective representations from raw visual data and can be successfully employed to address typical challenges characterizing this task, such as low-quality images, different illuminations, and object scale variations. But apart from these difficulties, in this dissertation, we identified some other crucial limitations in the adoption of CNNs, proposing general solutions that we experimentally evaluated in the context of the counting task which turns out to be particularly affected by these shortcomings. In particular, we tackled the problem related to the lack of data needed for training current DL-based solutions. Given that the budget for labeling is limited, data scarcity still represents an open problem that prevents the scalability of existing solutions based on the supervised learning of neural networks and that is responsible for a significant drop in performance at inference time when new scenarios are presented to these algorithms. This concern is particularly evident in tasks such as the counting one, where the objects to be labeled are hundreds, or even thousands, per image, significantly increasing the human effort needed for the annotation procedure. We proposed solutions addressing this issue from several complementary sides. We introduced synthetic datasets gathered from virtual environments resembling the real world, where the training labels are automatically collected, therefore drastically reducing the human effort for the annotation procedure. We proposed Domain Adaptation (DA) strategies, both supervised and unsupervised, aiming at mitigating the domain gap existing between the training and test data distributions. We presented a counting strategy in a weakly labeled data scenario, i.e., in the presence of non-negligible disagreement between multiple annotators, enhancing counting performance by taking advantage of the redundant information due to raters’ judgment differences. Moreover, we tackled the non-trivial engineering challenges coming out of the adoption of CNN-based techniques in environments with limited power resources, mainly due to the high computational budget the AI-based algorithms require. We introduced solutions for counting vehicles directly onboard embedded vision systems, i.e., devices equipped with constrained computational capabilities that can capture images and elaborate them. Finally, we designed an embedded modular Computer Vision-based and AI-assisted system that can carry out several tasks to help monitor individual and collective human safety rules, such as estimating the number of people present in a region of interest.
