Add image resizing functionality for perseus file exports #4786

KshitijThareja · 2024-10-13T09:15:18Z

Summary

Description of the change(s) you made

This PR introduces image resizing functionality to resize assessment images containing resizing data and making sure that the final perseus zip contains resized images.

Manual verification steps performed

Creating assessments containing varied image data, like original, resized and duplicate images.
Publishing the channel and testing the changes on local Kolibri instance by importing it from the locally running studio server.

Reviewer guidance

How can a reviewer test these changes?

Follow the verification steps listed under the previous heading.
Create a dummy exercise and try adding images and resizing them in the exercise questions.
Publish the channel and import it in Kolibri(Use local studio for imports by changing the environment variable CENTRAL_CONTENT_BASE_URL to point to the URL of your local studio instance).
Checkout the imported assessment to see if images have been resized accordingly or not.

References

Closes #4416

Comments

The approach used here slightly deviates from what was told in the issue description. While working on this issue, I found out that the control flow of the channel publish method never entered the image loop in create_perseus_zip function. So, the image resizing function couldn't be applied there. Instead, the image resizing now happens in the process_image_strings function as this is where the image data is modified originally before being written to the assessment file and it contains the image resizing data too.

Contributor's Checklist

PR process:

If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
If this includes an internal dependency change, a link to the diff is provided
The docs label has been added if this introduces a change that needs to be updated in the user docs?
If any Python requirements have changed, the updated requirements.txt files also included in this PR
Opportunities for using Google Analytics here are noted
Migrations are safe for a large db

Studio-specifc:

All user-facing strings are translated properly
The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
All UI components are LTR and RTL compliant
Views are organized into pages, components, and layouts directories as described in the docs
Users' storage used is recalculated properly on any changes to main tree files
If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

Code is clean and well-commented
Contributor has fully tested the PR manually
If there are any front-end changes, before/after screenshots are included
Critical user journeys are covered by Gherkin stories
Any new interactions have been added to the QA Sheet
Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

Automated test coverage is satisfactory
PR is fully functional
PR has been tested for accessibility regressions
External dependency files were updated if necessary (yarn and pip)
Documentation is updated
Contributor is in AUTHORS.md

bjester

I haven't yet wrapped my head around the whole resize procedure itself. Although, I do see a situation where we should be cautious regarding reading files from storage.

I think you've shown good and valid instincts by attempting to consolidate code regarding the reading of images from storage, but I think performance considerations will need to take precedence since storage represents Google Cloud Storage (GCS). There'll be significant latency involved in any operations with GCS (compared with non-access), including reading an image file from storage.

Since the location of this code exists within the channel publishing procedure, which is an area of growing need for optimization, we should be cautious with anything that may add to its duration. A channel could have many exercises, and within each exercise, there could be many questions and answers, either of which could hold images. So during the publishing process, we should avoid accessing them in storage until absolutely necessary.

At a higher level, it seems there exists at least one pathway where original_content could go unused, meaning we could be greatly increasing the duration of the publishing procedure, when it was unnecessary. I'd be interested how far we can go with avoiding storage access within the resizing procedure itself, but like I said, I haven't wrapped my head around it all. I also realize that are more challenges with regards to the resizing procedure than the high level path I noted.

bjester · 2024-10-15T20:25:05Z

contentcuration/contentcuration/utils/publish.py

@@ -708,18 +728,40 @@ def process_image_strings(content, zf, channel_id):
                    logging.warning("NOTE: the following error would have been swallowed silently in production")
                    raise

-            image_name = "images/{}.{}".format(checksum, ext[1:])
-            if image_name not in zf.namelist():
-                with storage.open(ccmodels.generate_object_storage_name(checksum, filename), 'rb') as imgfile:


Prior to the changes here, it looks like reading from storage only occurred within the above if condition, which seems to test whether the file exists in the perseus zip.

bjester · 2024-10-15T20:27:19Z

contentcuration/contentcuration/utils/publish.py

-                image_list.append(image_data)
-            content = content.replace(match.group(1), img_match.group(1))
+            original_image_name = "images/{}.{}".format(checksum, ext[1:])
+            with storage.open(ccmodels.generate_object_storage_name(checksum, filename), 'rb') as imgfile:


After your changes here, it seems we'll always read the image from storage, for every match in the for-loop that contains this.

bjester · 2024-10-15T20:44:09Z

contentcuration/contentcuration/utils/publish.py

+                            write_to_zipfile(original_image_name, original_content, zf)
+                        content = content.replace(match.group(1), img_match.group(1))
+                else:
+                    if original_image_name not in zf.namelist():


Seems in this high level path, if it results in False, then original_content goes unused?

KshitijThareja · 2024-10-16T07:43:32Z

Hi @bjester!
I read through all your comments, and yeah, there's certainly room for improvement and the issue with accessing the storage for every image can definitely be fixed. This is the first iteration of the working resizing mechanism, which required me to try out a lot of different approaches. I'll work on the potential issues and update the PR in some time.
Thanks for your inputs 😄

KshitijThareja · 2024-10-22T09:49:49Z

Hi @bjester!
I was trying to find a way to optimize the resizing process. But it seems like the original_content won't go unused in any case as all the the images will definitely go through one of the two conditionals and it most probably won't result in False. None of the images would be present in the zipfile initially, and they will only be added to it from one of the conditional statements.
I am a little unsure of how to proceed. Do you have any suggestions as to what should be done in this case?

bjester · 2024-10-23T21:58:31Z

@KshitijThareja upon further review, seems I didn't have my head around this code entirely. I think the only area where we might save time would be when the same image is used multiple times in the exercise. I think the existing check does that because the checksum would be the same and therefore the file would already exist in the zip. I think you can mimic that for the resizing, producing a zero net effect on the duration, by keeping a mapping of the original to resized name. That way, we only ever read the same image once. You might have to also keep track of the width/height along with that, in case the same image is used in multiple sizes. Although, perhaps if that's the case, we could keep only one size of the image?

KshitijThareja · 2024-10-25T18:11:28Z

@bjester So what you are suggesting here is that if an image is repeated, we keep the original one(with/without resizing). If that's the case, and if this behavior is desirable, we can surely do that. That could help save time, although only if we have repeated image uploads as you mentioned.

bjester · 2024-10-25T18:44:46Z

@KshitijThareja no not exactly. Prior to the image resizing, if an image was repeated, that means its checksum should repeat. If that makes sense, then the check if original_image_name not in zf.namelist() would have skipped repeats because a file with the same name already exists. That then saves unnecessary reads from GCS.

What I'm suggesting is that we preserve this handling, because in such a situation, the performance cost should be no different than it was before. It seems this publishing code always processes images if the exercise has changed, which was my mistake in my initial review. So that means if we prevent reprocessing of duplicate images (specifically in regards to how many reads from GCS) then the overall performance should be similar, ignoring for the moment the cost of resizing because that's the desired feature.

It gets a little complicated, because the same image could appear with multiple rendered sizes. Although, there could be benefit to handling this sort of scenario. For example, say a particular image with native size of 1024x1024 is used 2 times. Lets pretend the rendered sizes of the image are 512x512 and 568x568. If we go through each instance, my understanding of your work thus far should produce 2 images in the zip file, one for each size. Resizing the images saves disk space, but because the image is used multiple times, each duplicate would start to reduce the effectiveness of the process in regards to disk space. The resizing process itself satisfies the original issue, where images are used in answers. But in the case of the same image used multiple times, it reduces the effectiveness of the resizing process saving space in the zip:

we would be shipping a much larger image than is necessary in the bundled perseus exercise.

So what I propose, is that we handle 2 things at once, by focusing on handling duplicates. First, by handling duplicates, we can prevent repeat reads from GCS, meaning the prior behavior is preserved. Secondly, we could institute logic that determines when we could just utilize just a single copy of the image resized. For instance, if two rendered sizes are within 1% of each other, it seems we could save space by just using one resized image (perhaps the smaller) and have the other usage use the same image, instead of 2 resized versions of the same image.

*the 1% leeway in sizes was just for example-- I'm not sure what the sweet spot is for that

KshitijThareja · 2025-01-29T11:54:41Z

Hi @bjester!
I'm sorry this issue took so long. I've made the required changes as suggested. However, from your review here:

we could save space by just using one resized image (perhaps the smaller) and have the other usage use the same image, instead of 2 resized versions of the same image

I've updated the code to use a similar resized image (within 1% of the current image to be resized) from the resized_images_map. But in case the current image being processed is the smaller one as compared to the similar image from the resized_images_map, the current code would still be picking up the image from map(which is comparatively bigger by 1%) and not the current one(smaller by 1%).

This is because If we decide to go with the current one and replace the existing one(from the map) with it, that'd make the whole process complicated as the existing image would have already been written to the zipfile and we'd again have to access storage to make the required changes, reducing overall performance.

bjester · 2025-02-11T15:55:23Z

Hi @KshitijThareja,

Thank you for revisiting this. My apologies as well for the delayed response. I will take another look over your changes

KshitijThareja · 2025-02-13T18:51:55Z

Hi @bjester. No worries. Do let me know if any changes are required in the PR 👍

bjester · 2025-02-18T22:49:01Z

@KshitijThareja I believe I see what you're saying. How are you feeling about this feature? Would you like to close it out? I think you've satisfied our requests and the feature's purpose. I think an improvement would be separating this new behavior into a class or functions which can be unit tested. If it was extracted into a class, then it could manage its state on its own and operate on the zipfile as an input. Unless I'm misunderstanding, I think a class could encapsulate the logic and make the current function clearer, but could also defer most processing until it has found all references to images, and only then start resizing and reading from storage thereby eliminating what you described?

Again, I think you've satisfied the feature, and from here, we could focus on improving code itself, mainly reducing the complexity of the function by splitting it into smaller bits. My only concern is proper test coverage, but if we intend further code improvements, it might be better to tackle that first.

KshitijThareja · 2025-02-22T02:39:55Z

Hi @bjester.
I can surely give it a try and see if we can achieve that. But it'll take me some time to get the changes done. Hope that's okay..
Also, I guess I could work on writing unit tests for the resizing images functionality once everything is done and include them in this PR itself.

bjester · 2025-02-24T16:01:22Z

@KshitijThareja We can move forward with merging this first, if you'd like? That will also allow us to ensure that this new code is functional for various channels by testing it in our unstable server environment, which may be helpful while the code is consolidated in a followup task.

KshitijThareja · 2025-02-25T07:57:29Z

Sure @bjester, that sounds good 👍. We can have followup issues for any additional work related to the resizing feature.

Introduce image resizing functionality for processing images

46a15ed

MisRob requested review from jredrejo, rtibbles and bjester October 14, 2024 08:45

MisRob added the TODO: needs review label Oct 14, 2024

bjester requested changes Oct 15, 2024

View reviewed changes

rtibbles added this to the Studio: Q4 patch release 1 milestone Oct 22, 2024

bjester removed this from the Studio: Q4 patch release 1 milestone Oct 25, 2024

rtibbles assigned bjester Nov 8, 2024

Optimize the resiszing feature to handle duplicate or similar images

88652f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add image resizing functionality for perseus file exports #4786

Add image resizing functionality for perseus file exports #4786

KshitijThareja commented Oct 13, 2024

bjester left a comment

bjester Oct 15, 2024

bjester Oct 15, 2024

bjester Oct 15, 2024

KshitijThareja commented Oct 16, 2024

KshitijThareja commented Oct 22, 2024 •

edited

Loading

bjester commented Oct 23, 2024

KshitijThareja commented Oct 25, 2024

bjester commented Oct 25, 2024 •

edited

Loading

KshitijThareja commented Jan 29, 2025

bjester commented Feb 11, 2025

KshitijThareja commented Feb 13, 2025

bjester commented Feb 18, 2025

KshitijThareja commented Feb 22, 2025

bjester commented Feb 24, 2025

KshitijThareja commented Feb 25, 2025

Add image resizing functionality for perseus file exports #4786

Are you sure you want to change the base?

Add image resizing functionality for perseus file exports #4786

Conversation

KshitijThareja commented Oct 13, 2024

Summary

Description of the change(s) you made

Manual verification steps performed

Reviewer guidance

How can a reviewer test these changes?

References

Comments

Contributor's Checklist

Reviewer's Checklist

This section is for reviewers to fill out.

bjester left a comment

Choose a reason for hiding this comment

bjester Oct 15, 2024

Choose a reason for hiding this comment

bjester Oct 15, 2024

Choose a reason for hiding this comment

bjester Oct 15, 2024

Choose a reason for hiding this comment

KshitijThareja commented Oct 16, 2024

KshitijThareja commented Oct 22, 2024 • edited Loading

bjester commented Oct 23, 2024

KshitijThareja commented Oct 25, 2024

bjester commented Oct 25, 2024 • edited Loading

KshitijThareja commented Jan 29, 2025

bjester commented Feb 11, 2025

KshitijThareja commented Feb 13, 2025

bjester commented Feb 18, 2025

KshitijThareja commented Feb 22, 2025

bjester commented Feb 24, 2025

KshitijThareja commented Feb 25, 2025

KshitijThareja commented Oct 22, 2024 •

edited

Loading

bjester commented Oct 25, 2024 •

edited

Loading