FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

IMAG, Nanjing University of Science and Technology

Abstract

Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.

Method

Our goal is to present an effective method to fully harness the power of LDM for image SR. Specifically, we first employ the encoder of a pre-trained Variational Autoencoder (VAE) to map LQ inputs into the latent space and extract the corresponding LQ features. Then, we develop an alignment module to effectively transfer useful information from the latent LQ features and ensure them align well with the diffusion process. In addition, we incorporate text embeddings, extracted from image descriptions using a pre-trained text encoder, as auxiliary information. These embeddings are integrated with the latent features of the diffusion model through cross-attention layers to help explore useful structural information. Furthermore, we propose a unified feature optimization strategy to jointly fine-tune the VAE encoder and the diffusion model, allowing the encoder to extract useful information from LQ images that facilitates the diffusion process while enabling the diffusion model to further refine the extracted feature for HQ image SR. Finally, we obtain the restored image from the refined features by a pre-trained VAE decoder.