Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration between deployments/Export project functionality #1780

Open
matusdrobuliak66 opened this issue May 14, 2024 · 2 comments
Open

Migration between deployments/Export project functionality #1780

matusdrobuliak66 opened this issue May 14, 2024 · 2 comments
Assignees
Labels
PO issue Created by Product owners

Comments

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented May 14, 2024

Based on the working group ITISFoundation/osparc-ops-environments#672 we decided we will investigate these 3 options:

440871009_3694197630909287_7269496293591851454_n

(1) Importing from target deployment

Using an ad-hoc GUI the user can import thier projects from another deployment.

Prerequisits:

  • user must have an account in bouth source and destination deployments
  • user must authenticate with his credentials from source inside the destination deployment (this generates tokesn for the purpose of importing projects)

Chnages to oSPARC:

  • create endpoint for authenticating the user in another deployment
  • create endpoint for listing projects available to the user (maybe we can reuse soemething?)
  • create endpoint to start a copy (lock project): provides "project data" + "tokens to copy data from s3"
  • sumbit a job that "imports" the project: first sync data then insert project in db, if it fails remove data.
  • create endpoint for signal copy operation is done (unlocks project)

PROS:

  • not very complex, we rely on already existing tools and just generate a few ne API endpoints
  • potentially can be used internally to make a copy of an existing project (target the same deployment)
  • avoids creating a "data model" for exporting and importing user data by rclone copy S3 to S3

CONS:

  • user does not get access to their data (they can only move it from deployment A to deployment B)

(2) Archiving

Generate an archive containing project data and data stored in all nodes.

Prerequisits:

  • user must have an account in both source and desitnation deployments
  • user must have enough disk space to download the archive to his computer

Changes to oSPARC:

  • create endpoint for starting the export procedure
  • background job that creates the archive:
    • donwload files and put them in an archive envtually compressing them + packaging the data model for the porject
    • upload the arhcive to S3 (with an expiration)
    • notify user (via email?) that the arhvie is available for download
  • solid upload process that is able to resume (require backend/FE coordination)
    • split file into chunks
    • retry if chunk fails to upload
    • put chunks together in a unique file
  • import process (once file is available start import)
    • check archive validity (nobody tamperred with it)
    • extract data from the archive and upload to S3 (rollback on error)
    • insert project in DB

PROS:

  • user has phisical copy of the data, by opening the archive he could extract a single file

CONS:

  • requires a third party computer (the user's) to download the arhive and upload the archive
  • uses two extra step form solution (1): archive creation and archive extraction
  • require more moving parts that:
    • links that expire
    • archive management: import + export
    • there is one extra job queue (for exporting)

(3) Migration

The idea here is to migrate one deployment to another.

  • migrate S3 data
  • Database Migration (issues with autogenerated integer primary/foreign keys) - Potential solutions:
    • Change the primary keys to randomly generated string IDs.
    • Retain integer keys but artificially increase the integers by a large number.
    • Change int to string and add some prefix (different prefix in different deployment)
    • Almost all tables:
      • clusters, cluster_to_groups
      • comp_runs
      • comp_tasks
      • folders, workspaces
      • groups + all resources access rights
      • payments
      • resource tracker
      • pricing plans / units / costs
      • users
      • ...

PROS:

  • We will not face issues with migration between deployments in the future.

CONS:

  • It's a one-time full migration between deployments effort (not a feature for users as in previous cases)

Tasks

Preview Give feedback
No tasks being tracked yet.
@matusdrobuliak66 matusdrobuliak66 added this to the Leeroy Jenkins milestone May 14, 2024
@sanderegg sanderegg removed this from the South Island Iced Tea milestone Jul 8, 2024
@sanderegg sanderegg added this to the Eisbock milestone Aug 13, 2024
@sanderegg sanderegg removed this from the Eisbock milestone Sep 13, 2024
@pcrespov
Copy link
Member

pcrespov commented Sep 30, 2024

Brainstorming on Sep.27, 2024

There was no consensus on a clear preference for any of the proposed solutions above. Below some notes from the discussion

Data Migration from Source to Destination Database

When migrating data between databases, especially PostgreSQL tables with identifiers and relationships, it’s important to go beyond just viewing it as a transfer of data rows. The semantics of the data (i.e., the meaning of the entities and their relationships) must also be considered. Still, some of the key challenges can already be identified, particularly around merging data that exists in both the source and destination databases:

Key Challenges:

  1. Integer Identifiers:

    • Apply an offset to the source table IDs by adding the maximum ID value from the destination table to avoid conflicts.
    • While it’s not mandatory, switching to more unique, descriptive identifiers (similar to Stripe-like IDs such as name_1456123456asdfa45) would be preferable.
  2. Merging Existing Resources (e.g., Users, Products):

    • Users: Handle records where users have the same email address in both source and destination databases.
    • Products: Manage cases where products share the same product name across both databases.
    • Group 1: Identify and handle additional resource overlaps.
  3. Maintaining Dependencies (e.g., Groups):

    • To preserve data integrity, ensure that related records (e.g., groups) are inserted in the correct order during migration. This guarantees that dependencies are maintained.

A Semantic Approach to Migration

Considering the database's structure and meaning, a more strategic approach is to break the migration into stages based on different contexts. This allows for grouping related tables and migrating them together, either manually or automatically.

Identified Contexts:

  1. Platform Configurations:

    • Clusters
    • Products
    • Product Prices
    • (...)
  2. Users:

    • Users
    • Wallets
    • User Preferences (Frontend)
    • (Additional user-related tables)
  3. Services:

    • Service Metadata
    • Service Access Rights
    • (Additional service-related tables)
  4. Studies (Projects + Data):

    • Projects
    • Folders
    • File Metadata
    • (...)

Migration Process Requirements

  1. Data Integrity Checks:

    • Every step of the migration process must include validation checks to ensure data integrity, preventing corruption or data loss.
  2. Checkpoints for Rollback:

    • Implement checkpoints at various stages of the migration to allow for reversion in case a data integrity check fails, ensuring a safe fallback.

Features

Even thought his process will be mostly carried out once and in the backend, it might have a big value if the ability to import/export studies should be available as a standalone feature for users

@odeimaiz odeimaiz transferred this issue from ITISFoundation/osparc-simcore Nov 25, 2024
@odeimaiz odeimaiz added the PO issue Created by Product owners label Nov 25, 2024
@matusdrobuliak66
Copy link
Contributor Author

matusdrobuliak66 commented Jan 8, 2025

Discussion on Jan.08, 2025

Dustin and Sylvain R. had a discussion, during which it was concluded that an export/import project functionality might be needed to provide users with the option to migrate from TIP in-house to TIP in the cloud. This functionality is a prerequisite for shutting down the TIP in-house deployment.

My Takeaways from Discussion and Proposed Action Plan

  • We create export and import endpoints
    • For user resource
    • For project resource
  • Export will retrieve a JSON type metadata file "artifact" where all important information will be. For example for project:
    • Project info -> Services info (including potencionally AWS presigned download links for S3 data download)

Project import

  • Import endpoint will recieve the metadata information
  • An asyncronious task will start which might return some job ID (we might create a special table for this usecase)
  • Import implementation will reuse as much functionality as we have, ex:
    • creation/update of the project
    • creation/update of the project node
  • It will create new resource IDs on the fly during the import
  • It will also trigger some AWS service (outside of the simcore docker stack) that will take care of downloading or moving S3 data to the right location
  • When import is finished user will get notified for example by email.
  • To keep the process as simple and minimalistic as possible, we will store only the essential data required to upload the project. This will be clearly and transparently explained to the user. For example: all tags will be lost, all sharing settings will be removed, and even workspace and folder paths may be discarded. Essentially, the project will always appear in the root folder within the private workspace. History of service runs will be lost.

User import/export

  • A similar logic can be used to build a user import endpoint, which will be utilized exclusively by admins. For example, it can be executed in a loop to migrate users from one deployment to another.

Services

  • We will conduct a manual analysis of the differences between the services in the two deployments. Based on this analysis, we will decide which services to support. These services will then be migrated manually or with the help of custom scripts.

Product / Platform

  • These will be manually set up in the new deployment.

Action

One of the first steps we can take is to start writing unit test that utilize the existing creation and update functionality in the code. These test will create a project with multiple services using newly generated IDs. The input to the test will resemble the metadata JSON, which aligns with the export functionality we aim to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PO issue Created by Product owners
Projects
None yet
Development

No branches or pull requests

7 participants