Dialogue Enhancement and Personalization – Contributions to Quality Assessment and Control
The production and delivery of audio for television involve many creative and technical challenges. One of them is concerned with the level balance between the foreground speech (also referred to as dialogue) and the background elements, e.g., music, sound effects, and ambient sounds. Background elements are fundamental for the narrative and for creating an engaging atmosphere, but they can mask the dialogue, which the audience wishes to follow in a comfortable way. Very different individual factors of the people in the audience clash with the creative freedom of the content creators. As a result, service providers receive regular complaints about difficulties in understanding the dialogue because of too loud background sounds. While this has been a known issue for at least three decades, works analyzing the problem and up-to-date statics were scarce before the contributions in this work. Enabling the user to personalize the dialogue level provides a technological relief from the problem of a background perceived as too loud. The content creators are free to craft the audio soundtrack according to their artistic vision, yet personalization is available if users want it. This functionality is often referred to as Dialogue Enhancement (DE) and can be implemented with object-based audio, requiring separate audio objects (or stems) from the production stage. Stems are often not available, e.g., for archive material, solely consisting of the final stereo soundtracks. Blind Source Separation (BSS) can be applied on the final soundtrack to estimate the stems. However, before the contributions in this dissertation, only a few works dealt with BSS techniques directly applicable for television content and with evaluation methods for this application. BSS might introduce artifacts, colorations, and distortions. These are highly undesired, as the final audio quality is of the utmost importance in television. Before the contributions in this dissertation, it was not clear what subjective and objective methods could be used to evaluate audio quality for this application. Previous works focusing on objective methods did not answer the question whether objective measures perform well enough for detecting perceivable quality degradations introduced by BSS for DE and for counteracting against them. The main contributions of this work advance our knowledge on the issues identified above and can be organized into the following three categories. Firstly, via a nationwide survey, insight is obtained into the gravity and into the causes of the frustration that the television audience experiences in relation to the audio balance between dialogue and background. Furthermore, controlled experiments are carried out to investigate personal preferences about the level difference between dialogue and background. Highly individual preferences are observed, settling the importance of providing the final user with DE. Moreover, it is observed that audio experts prefer louder background than non-experts (on average 4 dB explaining part of the frustration experienced by the non-experts in the audience. Based on these experiments, technical guidelines are formulated for the production of esthetically pleasing audio for television with clear speech. Secondly, BSS solutions specifically designed for DE are presented, both based on traditional signal processing and based on Deep Neural Networks (DNNs), where special attention is given to the quality assessment of these solutions. Subjective and objective evaluation methods are proposed to evaluate BSS for DE. The subjective methods span a large range of factors influencing the final Quality of Experience (QoE). Listening Effort (LE) is evaluated in a multimodal way, including pupillometry and considering audio signals representative of the application. Other quality factors such as the perception of artifacts and distortions, or the overall audio quality are studied with approaches inspired by standard methodologies, but with modifications relevant to the application. Approaching the overall QoE, the Adjustment/Satisfaction Test (A/ST) is proposed, as a method to evaluate content personalization and the resulting user satisfaction. In addition, a large number of users are surveyed after having interacted with a prototype of the final system. Thanks to these methods, different BSS approaches at different development stages are evaluated, and it is shown that the proposed BSS solutions can successfully enable DE. Objective measures are investigated in terms of their response to distortions typically encountered in BSS for DE as well as their correlation with subjective quality scores. It is shown that objective measures very often do not generalize to cases unseen during their development. The measures that exhibit the best performance are then considered to perform automatic quality control. Thirdly, automatic quality control is discussed in detail. Solutions are proposed to control the remixing of the estimated stems, with the aim of maximizing the background attenuation under a constraint on the minimum quality of the remixed output. DNNs are trained to either estimate directly the remixing gain or in a two-step approach, in which a non-intrusive quality estimate is first obtained and then mapped to the remixing gains. In summary, contributions to the most relevant areas concerning BSS for DE in television audio are given, providing a better understanding of the importance of DE, and laying the ground for methodological development, evaluation, and control of BSS for DE. Evidence of the success of these contributions can be identified in one outcome from the nationwide survey, where the BSS-based DE, developed within the proposed evaluation framework, was clearly preferred over the original soundtrack. This is a cumulative dissertation (or thesis by publication). Its body consists of a previously unpublished exposition illustrating the contextual links between previously published works, all peer-reviewed. The interested reader is referred to the appendix, where the individual publications can be found.
