-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add image resizing functionality for perseus file exports #4786
base: unstable
Are you sure you want to change the base?
Add image resizing functionality for perseus file exports #4786
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet wrapped my head around the whole resize procedure itself. Although, I do see a situation where we should be cautious regarding reading files from storage.
I think you've shown good and valid instincts by attempting to consolidate code regarding the reading of images from storage, but I think performance considerations will need to take precedence since storage
represents Google Cloud Storage (GCS). There'll be significant latency involved in any operations with GCS (compared with non-access), including reading an image file from storage.
Since the location of this code exists within the channel publishing procedure, which is an area of growing need for optimization, we should be cautious with anything that may add to its duration. A channel could have many exercises, and within each exercise, there could be many questions and answers, either of which could hold images. So during the publishing process, we should avoid accessing them in storage until absolutely necessary.
At a higher level, it seems there exists at least one pathway where original_content
could go unused, meaning we could be greatly increasing the duration of the publishing procedure, when it was unnecessary. I'd be interested how far we can go with avoiding storage access within the resizing procedure itself, but like I said, I haven't wrapped my head around it all. I also realize that are more challenges with regards to the resizing procedure than the high level path I noted.
@@ -708,18 +728,40 @@ def process_image_strings(content, zf, channel_id): | |||
logging.warning("NOTE: the following error would have been swallowed silently in production") | |||
raise | |||
|
|||
image_name = "images/{}.{}".format(checksum, ext[1:]) | |||
if image_name not in zf.namelist(): | |||
with storage.open(ccmodels.generate_object_storage_name(checksum, filename), 'rb') as imgfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to the changes here, it looks like reading from storage only occurred within the above if
condition, which seems to test whether the file exists in the perseus zip.
image_list.append(image_data) | ||
content = content.replace(match.group(1), img_match.group(1)) | ||
original_image_name = "images/{}.{}".format(checksum, ext[1:]) | ||
with storage.open(ccmodels.generate_object_storage_name(checksum, filename), 'rb') as imgfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After your changes here, it seems we'll always read the image from storage, for every match in the for-loop that contains this.
write_to_zipfile(original_image_name, original_content, zf) | ||
content = content.replace(match.group(1), img_match.group(1)) | ||
else: | ||
if original_image_name not in zf.namelist(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems in this high level path, if it results in False
, then original_content
goes unused?
Hi @bjester! |
Hi @bjester! |
@KshitijThareja upon further review, seems I didn't have my head around this code entirely. I think the only area where we might save time would be when the same image is used multiple times in the exercise. I think the existing check does that because the checksum would be the same and therefore the file would already exist in the zip. I think you can mimic that for the resizing, producing a zero net effect on the duration, by keeping a mapping of the original to resized name. That way, we only ever read the same image once. You might have to also keep track of the width/height along with that, in case the same image is used in multiple sizes. Although, perhaps if that's the case, we could keep only one size of the image? |
@bjester So what you are suggesting here is that if an image is repeated, we keep the original one(with/without resizing). If that's the case, and if this behavior is desirable, we can surely do that. That could help save time, although only if we have repeated image uploads as you mentioned. |
@KshitijThareja no not exactly. Prior to the image resizing, if an image was repeated, that means its checksum should repeat. If that makes sense, then the check What I'm suggesting is that we preserve this handling, because in such a situation, the performance cost should be no different than it was before. It seems this publishing code always processes images if the exercise has changed, which was my mistake in my initial review. So that means if we prevent reprocessing of duplicate images (specifically in regards to how many reads from GCS) then the overall performance should be similar, ignoring for the moment the cost of resizing because that's the desired feature. It gets a little complicated, because the same image could appear with multiple rendered sizes. Although, there could be benefit to handling this sort of scenario. For example, say a particular image with native size of 1024x1024 is used 2 times. Lets pretend the rendered sizes of the image are 512x512 and 568x568. If we go through each instance, my understanding of your work thus far should produce 2 images in the zip file, one for each size. Resizing the images saves disk space, but because the image is used multiple times, each duplicate would start to reduce the effectiveness of the process in regards to disk space. The resizing process itself satisfies the original issue, where images are used in answers. But in the case of the same image used multiple times, it reduces the effectiveness of the resizing process saving space in the zip:
So what I propose, is that we handle 2 things at once, by focusing on handling duplicates. First, by handling duplicates, we can prevent repeat reads from GCS, meaning the prior behavior is preserved. Secondly, we could institute logic that determines when we could just utilize just a single copy of the image resized. For instance, if two rendered sizes are within 1% of each other, it seems we could save space by just using one resized image (perhaps the smaller) and have the other usage use the same image, instead of 2 resized versions of the same image. *the 1% leeway in sizes was just for example-- I'm not sure what the sweet spot is for that |
Hi @bjester!
I've updated the code to use a similar resized image (within 1% of the current image to be resized) from the This is because If we decide to go with the current one and replace the existing one(from the map) with it, that'd make the whole process complicated as the existing image would have already been written to the zipfile and we'd again have to access storage to make the required changes, reducing overall performance. |
Hi @KshitijThareja, Thank you for revisiting this. My apologies as well for the delayed response. I will take another look over your changes |
Hi @bjester. No worries. Do let me know if any changes are required in the PR 👍 |
@KshitijThareja I believe I see what you're saying. How are you feeling about this feature? Would you like to close it out? I think you've satisfied our requests and the feature's purpose. I think an improvement would be separating this new behavior into a class or functions which can be unit tested. If it was extracted into a class, then it could manage its state on its own and operate on the zipfile as an input. Unless I'm misunderstanding, I think a class could encapsulate the logic and make the current function clearer, but could also defer most processing until it has found all references to images, and only then start resizing and reading from storage thereby eliminating what you described? Again, I think you've satisfied the feature, and from here, we could focus on improving code itself, mainly reducing the complexity of the function by splitting it into smaller bits. My only concern is proper test coverage, but if we intend further code improvements, it might be better to tackle that first. |
Hi @bjester. |
@KshitijThareja We can move forward with merging this first, if you'd like? That will also allow us to ensure that this new code is functional for various channels by testing it in our |
Sure @bjester, that sounds good 👍. We can have followup issues for any additional work related to the resizing feature. |
Summary
Description of the change(s) you made
This PR introduces image resizing functionality to resize assessment images containing resizing data and making sure that the final perseus zip contains resized images.
Manual verification steps performed
Reviewer guidance
How can a reviewer test these changes?
CENTRAL_CONTENT_BASE_URL
to point to the URL of your local studio instance).References
Closes #4416
Comments
The approach used here slightly deviates from what was told in the issue description. While working on this issue, I found out that the control flow of the channel publish method never entered the image loop in
create_perseus_zip
function. So, the image resizing function couldn't be applied there. Instead, the image resizing now happens in theprocess_image_strings
function as this is where the image data is modified originally before being written to the assessment file and it contains the image resizing data too.Contributor's Checklist
PR process:
CHANGELOG
label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later timedocs
label has been added if this introduces a change that needs to be updated in the user docs?requirements.txt
files also included in this PRStudio-specifc:
notranslate
class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)pages
,components
, andlayouts
directories as described in the docsTesting:
Reviewer's Checklist
This section is for reviewers to fill out.
yarn
andpip
)