importccl: Allow users to IMPORT into an existing table #26834
Labels
A-disaster-recovery
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone
Currently IMPORT can only create a new table into which it imports data.
However many users would like to IMPORT into an existing table. In some cases this is because they want to configure a tables replication settings before importing the data. In others they want to do the import in stages, for example because they simply cannot fit all the data to import in a single source location.
Other users want to IMPORT new data at regular intervals (e.g. nightly dumps from some other system) though such cases raise additional questions around if and how the import can modify existing data.
The two main differences between importing a new table and into an existing table are that an existing table can have a) traffic interacting with it and b) existing data.
The one of biggest reasons IMPORT is so much faster than bulk-INSERTs is that it skips the majority of the transactional writing infrastructure, however this makes point a) tricky: we cannot ensure transactional safety of the normal sql traffic -- for which transaction safety is implied -- while bulk ingestion is happening to the same data. The easiest way to reduce importing into an existing table to the current import w.r.t. a) is to simply remove the traffic -- schema change the table to an offline state, then import, then bring it back online.
Another reason bulk-ingestion is faster is that it prepares entire SSTs that go directly into RocksDB, replacing any values they happen to contain. In an empty, just-created table, that replacement is not a concern, but for a table that contains data, it is. Collision detection is possible, but not cheap -- it'd require scanning the existing and importing data, either prior to or after ingestion, as well as some sort of resolution. For the primary key, one option is to just let the IMPORT replace the existing value, but detecting and handling conflicts in unique secondary indexes is a problem the implementation will need to address.
An initial prototype of this take-table-off-line-then-import approach was added in #37451.
Still TODO for the initial version to be considered complete:
AS OF SYSTEM TIME
before, at, after each of the the transition points.INSERT
IMPORT INTO
itselfwhen taking it offline, move to unvalidated if it was validated and record name of constraints moved in job.
after reverting on failure, can go straight back if it was validated before
otherswise we'll kick off validation.
The text was updated successfully, but these errors were encountered: