Our paper "Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization" is available at https://arxiv.org/abs/2412.18525.
Dataset and code will be available in two months.
Explanatory Instruction:
"Fill in all the empty outlines with rich colors that reflect vibrant tones, while redefining the shapes with smooth textures. Add layers of depth to the flat contours by enhancing brightness gradients in the sky, shadowing in the mountains, and intricate shades among the flowers. Reintroduce the sensation of open space and dimension by contrasting sharp objects with muted backgrounds and crisp details in the foreground."
Resolution:
448×448.
Explanatory Instruction:
"Slowly remove the rain falling from the sky in the image, still maintain the state of night, and the girl on the bridge is also still holding the umbrella, but readjust the light in the distance."
Limitations:
The model struggles to preserve smaller objects and environmental details.
Resolution:
448×448.
Explanatory Instruction:
"Increase the overall brightness to reveal details in dark areas while preserving highlights. Adjust the contrast to enhance the brightness differences between regions, making the structures and textures more distinct. Optimize color saturation to make previously dull colors more vibrant, such as the blue on the floor becoming more prominent. Apply denoising to reduce noise commonly found in low-light images, improving the overall quality. Ensure the final image appears natural while retaining the authentic style of the scene."
Limitations:
Controlling the intensity of lighting enhancement through language instructions is challenging, often resulting in significant deviations in the output.
Resolution:
448×448.
Explanatory Instruction:
"Remove the falling snow from the sky in the image, keep the other objects and snow in the image, still keep it dark, but pay attention to the adjustment of light behind the tree."
Limitations:
The second generated image struggles to retain nighttime details, while the third and fourth images exhibit poor performance in removing snow from the sky. Additionally, attempting to remove snow from the ground simultaneously can result in significant distortions.
Resolution:
448×448.
Explanatory Instruction:
"The image shows noticeable multiple visual overlaps of trees and buildings. I would like to remove visual overlaps and restore a clear, sharp image without blurring. Do not alter the main content and pay attention to adjusting the light."
Limitations:
The success rate of guiding the model's task-level zero-shot capability through language instructions is relatively low.
Resolution:
448×448.
Explanatory Instruction:
"Retain the distant clouds in the image while removing as much fog as possible. Attempt to restore the faintly visible sun in the distance, but ensure there is no strong sunlight. Focus on recovering the mountains and the nearby trees as much as possible."
Limitations:
It will cause distortions in certain objects.
Resolution:
448×448.