Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image resizing functionality for perseus file exports #4786

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

KshitijThareja
Copy link
Contributor

Summary

Description of the change(s) you made

This PR introduces image resizing functionality to resize assessment images containing resizing data and making sure that the final perseus zip contains resized images.

Manual verification steps performed

  1. Creating assessments containing varied image data, like original, resized and duplicate images.
  2. Publishing the channel and testing the changes on local Kolibri instance by importing it from the locally running studio server.

Reviewer guidance

How can a reviewer test these changes?

  1. Follow the verification steps listed under the previous heading.
  2. Create a dummy exercise and try adding images and resizing them in the exercise questions.
  3. Publish the channel and import it in Kolibri(Use local studio for imports by changing the environment variable CENTRAL_CONTENT_BASE_URL to point to the URL of your local studio instance).
  4. Checkout the imported assessment to see if images have been resized accordingly or not.

References

Closes #4416

Comments

The approach used here slightly deviates from what was told in the issue description. While working on this issue, I found out that the control flow of the channel publish method never entered the image loop in create_perseus_zip function. So, the image resizing function couldn't be applied there. Instead, the image resizing now happens in the process_image_strings function as this is where the image data is modified originally before being written to the assessment file and it contains the image resizing data too.


Contributor's Checklist

PR process:

  • If this is an important user-facing change, PR or related issue the CHANGELOG label been added to this PR. Note: items with this label will be added to the CHANGELOG at a later time
  • If this includes an internal dependency change, a link to the diff is provided
  • The docs label has been added if this introduces a change that needs to be updated in the user docs?
  • If any Python requirements have changed, the updated requirements.txt files also included in this PR
  • Opportunities for using Google Analytics here are noted
  • Migrations are safe for a large db

Studio-specifc:

  • All user-facing strings are translated properly
  • The notranslate class been added to elements that shouldn't be translated by Google Chrome's automatic translation feature (e.g. icons, user-generated text)
  • All UI components are LTR and RTL compliant
  • Views are organized into pages, components, and layouts directories as described in the docs
  • Users' storage used is recalculated properly on any changes to main tree files
  • If there new ways this uses user data that needs to be factored into our Privacy Policy, it has been noted.

Testing:

  • Code is clean and well-commented
  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Any new interactions have been added to the QA Sheet
  • Critical and brittle code paths are covered by unit tests

Reviewer's Checklist

This section is for reviewers to fill out.

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet wrapped my head around the whole resize procedure itself. Although, I do see a situation where we should be cautious regarding reading files from storage.

I think you've shown good and valid instincts by attempting to consolidate code regarding the reading of images from storage, but I think performance considerations will need to take precedence since storage represents Google Cloud Storage (GCS). There'll be significant latency involved in any operations with GCS (compared with non-access), including reading an image file from storage.

Since the location of this code exists within the channel publishing procedure, which is an area of growing need for optimization, we should be cautious with anything that may add to its duration. A channel could have many exercises, and within each exercise, there could be many questions and answers, either of which could hold images. So during the publishing process, we should avoid accessing them in storage until absolutely necessary.

At a higher level, it seems there exists at least one pathway where original_content could go unused, meaning we could be greatly increasing the duration of the publishing procedure, when it was unnecessary. I'd be interested how far we can go with avoiding storage access within the resizing procedure itself, but like I said, I haven't wrapped my head around it all. I also realize that are more challenges with regards to the resizing procedure than the high level path I noted.

@@ -708,18 +728,40 @@ def process_image_strings(content, zf, channel_id):
logging.warning("NOTE: the following error would have been swallowed silently in production")
raise

image_name = "images/{}.{}".format(checksum, ext[1:])
if image_name not in zf.namelist():
with storage.open(ccmodels.generate_object_storage_name(checksum, filename), 'rb') as imgfile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to the changes here, it looks like reading from storage only occurred within the above if condition, which seems to test whether the file exists in the perseus zip.

image_list.append(image_data)
content = content.replace(match.group(1), img_match.group(1))
original_image_name = "images/{}.{}".format(checksum, ext[1:])
with storage.open(ccmodels.generate_object_storage_name(checksum, filename), 'rb') as imgfile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After your changes here, it seems we'll always read the image from storage, for every match in the for-loop that contains this.

write_to_zipfile(original_image_name, original_content, zf)
content = content.replace(match.group(1), img_match.group(1))
else:
if original_image_name not in zf.namelist():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems in this high level path, if it results in False, then original_content goes unused?

@KshitijThareja
Copy link
Contributor Author

Hi @bjester!
I read through all your comments, and yeah, there's certainly room for improvement and the issue with accessing the storage for every image can definitely be fixed. This is the first iteration of the working resizing mechanism, which required me to try out a lot of different approaches. I'll work on the potential issues and update the PR in some time.
Thanks for your inputs 😄

@KshitijThareja
Copy link
Contributor Author

KshitijThareja commented Oct 22, 2024

Hi @bjester!
I was trying to find a way to optimize the resizing process. But it seems like the original_content won't go unused in any case as all the the images will definitely go through one of the two conditionals and it most probably won't result in False. None of the images would be present in the zipfile initially, and they will only be added to it from one of the conditional statements.
I am a little unsure of how to proceed. Do you have any suggestions as to what should be done in this case?

@rtibbles rtibbles added this to the Studio: Q4 patch release 1 milestone Oct 22, 2024
@bjester
Copy link
Member

bjester commented Oct 23, 2024

@KshitijThareja upon further review, seems I didn't have my head around this code entirely. I think the only area where we might save time would be when the same image is used multiple times in the exercise. I think the existing check does that because the checksum would be the same and therefore the file would already exist in the zip. I think you can mimic that for the resizing, producing a zero net effect on the duration, by keeping a mapping of the original to resized name. That way, we only ever read the same image once. You might have to also keep track of the width/height along with that, in case the same image is used in multiple sizes. Although, perhaps if that's the case, we could keep only one size of the image?

@KshitijThareja
Copy link
Contributor Author

@bjester So what you are suggesting here is that if an image is repeated, we keep the original one(with/without resizing). If that's the case, and if this behavior is desirable, we can surely do that. That could help save time, although only if we have repeated image uploads as you mentioned.

@bjester
Copy link
Member

bjester commented Oct 25, 2024

@KshitijThareja no not exactly. Prior to the image resizing, if an image was repeated, that means its checksum should repeat. If that makes sense, then the check if original_image_name not in zf.namelist() would have skipped repeats because a file with the same name already exists. That then saves unnecessary reads from GCS.

What I'm suggesting is that we preserve this handling, because in such a situation, the performance cost should be no different than it was before. It seems this publishing code always processes images if the exercise has changed, which was my mistake in my initial review. So that means if we prevent reprocessing of duplicate images (specifically in regards to how many reads from GCS) then the overall performance should be similar, ignoring for the moment the cost of resizing because that's the desired feature.

It gets a little complicated, because the same image could appear with multiple rendered sizes. Although, there could be benefit to handling this sort of scenario. For example, say a particular image with native size of 1024x1024 is used 2 times. Lets pretend the rendered sizes of the image are 512x512 and 568x568. If we go through each instance, my understanding of your work thus far should produce 2 images in the zip file, one for each size. Resizing the images saves disk space, but because the image is used multiple times, each duplicate would start to reduce the effectiveness of the process in regards to disk space. The resizing process itself satisfies the original issue, where images are used in answers. But in the case of the same image used multiple times, it reduces the effectiveness of the resizing process saving space in the zip:

we would be shipping a much larger image than is necessary in the bundled perseus exercise.

So what I propose, is that we handle 2 things at once, by focusing on handling duplicates. First, by handling duplicates, we can prevent repeat reads from GCS, meaning the prior behavior is preserved. Secondly, we could institute logic that determines when we could just utilize just a single copy of the image resized. For instance, if two rendered sizes are within 1% of each other, it seems we could save space by just using one resized image (perhaps the smaller) and have the other usage use the same image, instead of 2 resized versions of the same image.

*the 1% leeway in sizes was just for example-- I'm not sure what the sweet spot is for that

@bjester bjester removed this from the Studio: Q4 patch release 1 milestone Oct 25, 2024
@KshitijThareja
Copy link
Contributor Author

Hi @bjester!
I'm sorry this issue took so long. I've made the required changes as suggested. However, from your review here:

we could save space by just using one resized image (perhaps the smaller) and have the other usage use the same image, instead of 2 resized versions of the same image

I've updated the code to use a similar resized image (within 1% of the current image to be resized) from the resized_images_map. But in case the current image being processed is the smaller one as compared to the similar image from the resized_images_map, the current code would still be picking up the image from map(which is comparatively bigger by 1%) and not the current one(smaller by 1%).

This is because If we decide to go with the current one and replace the existing one(from the map) with it, that'd make the whole process complicated as the existing image would have already been written to the zipfile and we'd again have to access storage to make the required changes, reducing overall performance.

@bjester
Copy link
Member

bjester commented Feb 11, 2025

Hi @KshitijThareja,

Thank you for revisiting this. My apologies as well for the delayed response. I will take another look over your changes

@KshitijThareja
Copy link
Contributor Author

Hi @bjester. No worries. Do let me know if any changes are required in the PR 👍

@bjester
Copy link
Member

bjester commented Feb 18, 2025

@KshitijThareja I believe I see what you're saying. How are you feeling about this feature? Would you like to close it out? I think you've satisfied our requests and the feature's purpose. I think an improvement would be separating this new behavior into a class or functions which can be unit tested. If it was extracted into a class, then it could manage its state on its own and operate on the zipfile as an input. Unless I'm misunderstanding, I think a class could encapsulate the logic and make the current function clearer, but could also defer most processing until it has found all references to images, and only then start resizing and reading from storage thereby eliminating what you described?

Again, I think you've satisfied the feature, and from here, we could focus on improving code itself, mainly reducing the complexity of the function by splitting it into smaller bits. My only concern is proper test coverage, but if we intend further code improvements, it might be better to tackle that first.

@KshitijThareja
Copy link
Contributor Author

Hi @bjester.
I can surely give it a try and see if we can achieve that. But it'll take me some time to get the changes done. Hope that's okay..
Also, I guess I could work on writing unit tests for the resizing images functionality once everything is done and include them in this PR itself.

@bjester
Copy link
Member

bjester commented Feb 24, 2025

@KshitijThareja We can move forward with merging this first, if you'd like? That will also allow us to ensure that this new code is functional for various channels by testing it in our unstable server environment, which may be helpful while the code is consolidated in a followup task.

@KshitijThareja
Copy link
Contributor Author

Sure @bjester, that sounds good 👍. We can have followup issues for any additional work related to the resizing feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resize image files on perseus file export
4 participants