Fix already upload of zarr datasets with already existing `datasource-properties.json` file #8268

MichaelBuessemeyer · 2024-12-09T17:08:38Z

This PR fixes the upload of zarr datasets which already have a datasource-properties.json file present. The bug originates from (I'd say) wrong prioritization during finishing a dataset upload. First, it was checked whether the uploaded DS is in zarr format and only if that failed, the backend checked if there is a datasource-properties.json present (meaning the dataset is already ready to be used). This PR moves the checking for a datasource-properties.json first position. Thus, a full dataset with an already existing datasource-properties.json is just assumed as already explored an no additional effored it tried to generate a datasource-properties.json file for the dataset.
Previously, the backend would detect such a dataset as zarr and that it needed a datasource-properties.json whose contents needs to be guessed. And this is currently not supported by the zarr after upload exploration code. Thus, the uploads failed.

I hope this is the correct way of fixing this. An alternative (which I tried first before locating the wrong ordering of determining the ds format) was to fix the zarr upload exploration. But this created strange datasets -> all mags were interpreted as layers. Which lead to a wrong datasource-properties.json and moreover: !! @fm3 The server crashed with a heap overflow when trying to view the dataset. IMO this should never happen in case some datasource-properties.json is misconfigured.

URL of deployed dev instance (used for testing):

https://___.webknossos.xyz

Steps to test:

prepare a single layered zarr dataset with a datasource-properties.json
zip it
and upload it via the UI
this should succeed. As well as viewing the data

Issues:

fixes https://scm.slack.com/archives/C02H5T8Q08P/p1733738810435789

(Please delete unneeded items, merge only when none are left open)

Updated changelog

…roperties.json

coderabbitai · 2024-12-09T17:08:46Z

📝 Walkthrough

Walkthrough

The pull request introduces several updates to the WEBKNOSSOS project, including the addition of a tooltip displaying the total volume of datasets, renaming capabilities for datasets, and improved default colors for skeleton trees. The terminology "resolution" has been replaced with "magnification." Enhancements to AI model training accommodate differently sized bounding boxes. Multiple bug fixes address issues with dataset listing, uploads, annotation layer name restrictions, and synchronization of dataset settings. Additionally, support for HTTP API versions 3 and 4 has been removed, affecting backward compatibility.

Changes

File Path	Change Summary
CHANGELOG.unreleased.md	Updated to document new features, bug fixes, and terminology changes including tooltip for dataset volume, renaming datasets, and removal of HTTP API versions 3 and 4.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala	Modified `guessTypeOfUploadedDataSource` method to prioritize checks for explored data sources over Zarr arrays; no changes to public entity declarations.

Possibly related PRs

Zarr streaming e2e test #8137: Implementation of end-to-end tests for the Zarr streaming feature, relevant to AI model training enhancements.
Add dataset upload test (without using frontend) #8184: Adds a test for dataset uploads, connected to updates on dataset handling and renaming capabilities.
Relax bounding box requirements for model training #8222: Modifies bounding box requirements for model training, directly relating to enhancements for AI model training.
Add total voxel count and dataset extent to tooltip in dataset info tab #8229: Adds a tooltip for displaying the total volume of a dataset, aligning with the main PR's new feature.
Misc layout improvements in dataset upload #8231: Addresses layout issues in the upload view, relevant to layout improvements in the main PR.
Sync dataset name with datasource in add remote view #8245: Focuses on synchronizing dataset names with data sources, relating to updates on dataset management.
Fix permisson error in listGrouped in list datasets route #8249: Fixes permission errors in the dataset listing route, relevant to bug fixes in the main PR.
Fix legacy support for outdated displayName field of datasets #8263: Addresses legacy support for dataset display names, connecting to updates on dataset naming.
Release 24.12.0 #8264: Release note for version 24.12.0, including many changes made in the main PR.

Poem

🐰 In the garden of code, changes bloom bright,
With tooltips and colors, a beautiful sight.
Datasets can dance with new names they wear,
While bugs hop away, no longer a scare.
So let’s celebrate with a joyful cheer,
For WEBKNOSSOS shines, bringing us near! 🌼

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 14737b3 and e29abcc.

📒 Files selected for processing (2)

CHANGELOG.unreleased.md (1 hunks)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1 hunks)

🔇 Additional comments (2)

CHANGELOG.unreleased.md (1)

24-24: LGTM! Clear and accurate changelog entry.

The changelog entry accurately describes the fix for uploading zarr datasets with existing datasource-properties.json files.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

469-472: LGTM! Correct prioritization of data source type checks.

The reordering of checks to prioritize datasource-properties.json verification before zarr array detection is the right fix. This prevents the backend from attempting to guess the contents when the file already exists.

Let's verify that this is the only location where we determine the data source type:

✅ Verification successful

LGTM! The data source type determination is correctly isolated.

The search results confirm that:

The data source type determination logic is only implemented in UploadService.scala
The looksLikeExploredDataSource and looksLikeZarrArray checks are only used for type determination in this single location
Other usages of FILENAME_DATASOURCE_PROPERTIES_JSON are for reading/writing properties, not for type determination
No other implementations of guessTypeOfUploadedDataSource were found

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other instances of data source type determination
# Expect: Only this implementation should handle the logic

# Search for similar data source type determination logic
rg -A 5 "looksLikeExploredDataSource|looksLikeZarrArray|FILENAME_DATASOURCE_PROPERTIES_JSON"

# Search for other implementations of guessTypeOfUploadedDataSource
ast-grep --pattern 'def guessTypeOfUploadedDataSource'

Length of output: 12855

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)
469-471: Consider adding debug logging for dataset type detection.

To improve debuggability, consider adding debug-level logging that shows which conditions were checked and why a dataset was classified as a particular type. This would help diagnose similar issues in the future.

Example implementation:
   private def guessTypeOfUploadedDataSource(dataSourceDir: Path): UploadedDataSourceType.Value =
     if (looksLikeExploredDataSource(dataSourceDir).openOr(false)) {
+      logger.debug(s"Dataset at $dataSourceDir identified as already explored")
       UploadedDataSourceType.EXPLORED
     } else if (looksLikeZarrArray(dataSourceDir, maxDepth = 2).openOr(false)) {
+      logger.debug(s"Dataset at $dataSourceDir identified as zarr array")
       UploadedDataSourceType.ZARR

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 14737b3 and f9ac62b.

📒 Files selected for processing (1)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1 hunks)

🔇 Additional comments (1)

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/uploading/UploadService.scala (1)

469-471: LGTM! The fix correctly prioritizes checking for explored datasets.

The reordering of checks ensures that datasets with existing datasource-properties.json are properly identified as already explored, preventing unnecessary re-exploration attempts. This fixes the issue with zarr dataset uploads.

Let's verify the fix handles the described scenario correctly:

✅ Verification successful

Let me generate additional verification scripts to check the implementation details of these methods and their usage:

This will help us verify:

The actual implementation of the file checking logic
The constant definitions to ensure correct file names are being checked
The existence of test cases validating this behavior

Let me try one more verification to check the implementation details:

This will help us verify:

The constant definitions used for file checks
The implementation of the file matching logic
Any related test cases in the test directory

Based on the verification results, I can now generate the final response:

The fix correctly prioritizes checking for explored datasets and is properly implemented.

The code shows that:

FILENAME_DATASOURCE_PROPERTIES_JSON is a well-defined constant used consistently across the codebase
The containsMatchingFile helper method is used to check for file existence with proper depth control
The order of checks ensures that explored datasets (with datasource-properties.json) are identified before attempting zarr detection

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the handling of zarr datasets with existing datasource-properties.json

# Test 1: Check if there are any zarr datasets with datasource-properties.json
rg -l --type-add 'zarr:*.zarray' --type zarr . | while read -r file; do
    dir=$(dirname "$file")
    if [ -f "$dir/datasource-properties.json" ]; then
        echo "Found zarr dataset with properties: $dir"
        cat "$dir/datasource-properties.json"
    fi
done

# Test 2: Verify the exploration logic order in other files
rg -A 5 "looksLikeExploredDataSource|looksLikeZarrArray" --type scala

Length of output: 4766

Script:

#!/bin/bash
# Check the implementation of containsMatchingFile and the constants used
ast-grep --pattern 'private def containsMatchingFile'
ast-grep --pattern 'val FILENAME_DATASOURCE_PROPERTIES_JSON ='
ast-grep --pattern 'val FILENAME_DOT_ZARRAY ='

# Check for any test cases related to this functionality
rg -l "guessTypeOfUploadedDataSource|looksLikeExploredDataSource|looksLikeZarrArray" --type scala "*Test.scala" "*Spec.scala"

Length of output: 66266

Script:

#!/bin/bash
# Search for the constant definitions and method implementations
rg -A 2 "FILENAME_DATASOURCE_PROPERTIES_JSON|FILENAME_DOT_ZARRAY|containsMatchingFile" --type scala

# Look for test files specifically in the test directory
fd -e scala -e test "UploadService" "test/"

Length of output: 13549

fm3

Good find!

Heap overflow doesn’t sound good. Could you create an issue with steps to reproduce (maybe link to such a json as a slack link?)

Michael Büßemeyer added 2 commits December 9, 2024 17:43

fix already upload of zarr datasets with alread existing datasource-p…

108055e

…roperties.json

undo changes to exploreLocalDatasource in upload service

f9ac62b

MichaelBuessemeyer added bug backend zarr labels Dec 9, 2024

MichaelBuessemeyer self-assigned this Dec 9, 2024

add changelog entry

e29abcc

MichaelBuessemeyer requested a review from fm3 December 9, 2024 17:10

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

fm3 approved these changes Dec 9, 2024

View reviewed changes

MichaelBuessemeyer merged commit 5dc711d into master Dec 10, 2024
3 checks passed

MichaelBuessemeyer deleted the fix-explore-zarr-bug branch December 10, 2024 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix already upload of zarr datasets with already existing `datasource-properties.json` file #8268

Fix already upload of zarr datasets with already existing `datasource-properties.json` file #8268

MichaelBuessemeyer commented Dec 9, 2024 •

edited

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

fm3 left a comment

Fix already upload of zarr datasets with already existing datasource-properties.json file #8268

Fix already upload of zarr datasets with already existing datasource-properties.json file #8268

Conversation

MichaelBuessemeyer commented Dec 9, 2024 • edited Loading

URL of deployed dev instance (used for testing):

Steps to test:

Issues:

coderabbitai bot commented Dec 9, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

fm3 left a comment

Choose a reason for hiding this comment

Fix already upload of zarr datasets with already existing `datasource-properties.json` file #8268

Fix already upload of zarr datasets with already existing `datasource-properties.json` file #8268

MichaelBuessemeyer commented Dec 9, 2024 •

edited

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)