Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix already upload of zarr datasets with already existing datasource-properties.json file #8268

Merged
merged 3 commits into from
Dec 10, 2024

Conversation

MichaelBuessemeyer
Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer commented Dec 9, 2024

This PR fixes the upload of zarr datasets which already have a datasource-properties.json file present. The bug originates from (I'd say) wrong prioritization during finishing a dataset upload. First, it was checked whether the uploaded DS is in zarr format and only if that failed, the backend checked if there is a datasource-properties.json present (meaning the dataset is already ready to be used). This PR moves the checking for a datasource-properties.json first position. Thus, a full dataset with an already existing datasource-properties.json is just assumed as already explored an no additional effored it tried to generate a datasource-properties.json file for the dataset.
Previously, the backend would detect such a dataset as zarr and that it needed a datasource-properties.json whose contents needs to be guessed. And this is currently not supported by the zarr after upload exploration code. Thus, the uploads failed.

I hope this is the correct way of fixing this. An alternative (which I tried first before locating the wrong ordering of determining the ds format) was to fix the zarr upload exploration. But this created strange datasets -> all mags were interpreted as layers. Which lead to a wrong datasource-properties.json and moreover: !! @fm3 The server crashed with a heap overflow when trying to view the dataset. IMO this should never happen in case some datasource-properties.json is misconfigured.

URL of deployed dev instance (used for testing):

  • https://___.webknossos.xyz

Steps to test:

  • prepare a single layered zarr dataset with a datasource-properties.json
  • zip it
  • and upload it via the UI
  • this should succeed. As well as viewing the data

Issues:


(Please delete unneeded items, merge only when none are left open)

Copy link

coderabbitai bot commented Dec 9, 2024

📝 Walkthrough

Walkthrough

The pull request introduces several updates to the WEBKNOSSOS project, including the addition of a tooltip displaying the total volume of datasets, renaming capabilities for datasets, and improved default colors for skeleton trees. The terminology "resolution" has been replaced with "magnification." Enhancements to AI model training accommodate differently sized bounding boxes. Multiple bug fixes address issues with dataset listing, uploads, annotation layer name restrictions, and synchronization of dataset settings. Additionally, support for HTTP API versions 3 and 4 has been removed, affecting backward compatibility.

Changes

File Path Change Summary
CHANGELOG.unreleased.md Updated to document new features, bug fixes, and terminology changes including tooltip for dataset volume, renaming datasets, and removal of HTTP API versions 3 and 4.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala Modified guessTypeOfUploadedDataSource method to prioritize checks for explored data sources over Zarr arrays; no changes to public entity declarations.

Possibly related PRs

Poem

🐰 In the garden of code, changes bloom bright,
With tooltips and colors, a beautiful sight.
Datasets can dance with new names they wear,
While bugs hop away, no longer a scare.
So let’s celebrate with a joyful cheer,
For WEBKNOSSOS shines, bringing us near! 🌼


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 14737b3 and e29abcc.

📒 Files selected for processing (2)
  • CHANGELOG.unreleased.md (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1 hunks)
🔇 Additional comments (2)
CHANGELOG.unreleased.md (1)

24-24: LGTM! Clear and accurate changelog entry.

The changelog entry accurately describes the fix for uploading zarr datasets with existing datasource-properties.json files.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

469-472: LGTM! Correct prioritization of data source type checks.

The reordering of checks to prioritize datasource-properties.json verification before zarr array detection is the right fix. This prevents the backend from attempting to guess the contents when the file already exists.

Let's verify that this is the only location where we determine the data source type:

✅ Verification successful

LGTM! The data source type determination is correctly isolated.

The search results confirm that:

  • The data source type determination logic is only implemented in UploadService.scala
  • The looksLikeExploredDataSource and looksLikeZarrArray checks are only used for type determination in this single location
  • Other usages of FILENAME_DATASOURCE_PROPERTIES_JSON are for reading/writing properties, not for type determination
  • No other implementations of guessTypeOfUploadedDataSource were found
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other instances of data source type determination
# Expect: Only this implementation should handle the logic

# Search for similar data source type determination logic
rg -A 5 "looksLikeExploredDataSource|looksLikeZarrArray|FILENAME_DATASOURCE_PROPERTIES_JSON"

# Search for other implementations of guessTypeOfUploadedDataSource
ast-grep --pattern 'def guessTypeOfUploadedDataSource'

Length of output: 12855


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

469-471: Consider adding debug logging for dataset type detection.

To improve debuggability, consider adding debug-level logging that shows which conditions were checked and why a dataset was classified as a particular type. This would help diagnose similar issues in the future.

Example implementation:

   private def guessTypeOfUploadedDataSource(dataSourceDir: Path): UploadedDataSourceType.Value =
     if (looksLikeExploredDataSource(dataSourceDir).openOr(false)) {
+      logger.debug(s"Dataset at $dataSourceDir identified as already explored")
       UploadedDataSourceType.EXPLORED
     } else if (looksLikeZarrArray(dataSourceDir, maxDepth = 2).openOr(false)) {
+      logger.debug(s"Dataset at $dataSourceDir identified as zarr array")
       UploadedDataSourceType.ZARR
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 14737b3 and f9ac62b.

📒 Files selected for processing (1)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1 hunks)
🔇 Additional comments (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

469-471: LGTM! The fix correctly prioritizes checking for explored datasets.

The reordering of checks ensures that datasets with existing datasource-properties.json are properly identified as already explored, preventing unnecessary re-exploration attempts. This fixes the issue with zarr dataset uploads.

Let's verify the fix handles the described scenario correctly:

✅ Verification successful

Let me generate additional verification scripts to check the implementation details of these methods and their usage:

This will help us verify:

  1. The actual implementation of the file checking logic
  2. The constant definitions to ensure correct file names are being checked
  3. The existence of test cases validating this behavior

Let me try one more verification to check the implementation details:

This will help us verify:

  1. The constant definitions used for file checks
  2. The implementation of the file matching logic
  3. Any related test cases in the test directory

Based on the verification results, I can now generate the final response:

The fix correctly prioritizes checking for explored datasets and is properly implemented.

The code shows that:

  1. FILENAME_DATASOURCE_PROPERTIES_JSON is a well-defined constant used consistently across the codebase
  2. The containsMatchingFile helper method is used to check for file existence with proper depth control
  3. The order of checks ensures that explored datasets (with datasource-properties.json) are identified before attempting zarr detection
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the handling of zarr datasets with existing datasource-properties.json

# Test 1: Check if there are any zarr datasets with datasource-properties.json
rg -l --type-add 'zarr:*.zarray' --type zarr . | while read -r file; do
    dir=$(dirname "$file")
    if [ -f "$dir/datasource-properties.json" ]; then
        echo "Found zarr dataset with properties: $dir"
        cat "$dir/datasource-properties.json"
    fi
done

# Test 2: Verify the exploration logic order in other files
rg -A 5 "looksLikeExploredDataSource|looksLikeZarrArray" --type scala

Length of output: 4766


Script:

#!/bin/bash
# Check the implementation of containsMatchingFile and the constants used
ast-grep --pattern 'private def containsMatchingFile'
ast-grep --pattern 'val FILENAME_DATASOURCE_PROPERTIES_JSON ='
ast-grep --pattern 'val FILENAME_DOT_ZARRAY ='

# Check for any test cases related to this functionality
rg -l "guessTypeOfUploadedDataSource|looksLikeExploredDataSource|looksLikeZarrArray" --type scala "*Test.scala" "*Spec.scala"

Length of output: 66266


Script:

#!/bin/bash
# Search for the constant definitions and method implementations
rg -A 2 "FILENAME_DATASOURCE_PROPERTIES_JSON|FILENAME_DOT_ZARRAY|containsMatchingFile" --type scala

# Look for test files specifically in the test directory
fd -e scala -e test "UploadService" "test/"

Length of output: 13549

Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

Heap overflow doesn’t sound good. Could you create an issue with steps to reproduce (maybe link to such a json as a slack link?)

@MichaelBuessemeyer MichaelBuessemeyer merged commit 5dc711d into master Dec 10, 2024
3 checks passed
@MichaelBuessemeyer MichaelBuessemeyer deleted the fix-explore-zarr-bug branch December 10, 2024 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants