You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When importing "bundle" formats like pg_dump or mysqldump which include schema definitions and row-data in a single file, we currently download and parse the file to extract the schema definitions during import planning, because we want to resolve or create all the tables we will import into before we create the IMPORT job.
However when presented with something like a 300GB pg_dump file, this has the unfortunate effect of meaning that the IMPORT statement spends a long time in planning before creating a job, and in those minutes (or hours?) the user has no indication of what is going on -- there is no job to inspect or on which to report progress, even though we're clearly doing bulk-y work.
At the very least, it'd be more user-friendly to move the fetching and parsing of schemas to a prepare step of the actual import job execution, rather than in the planning phase. Ideally we'd avoid the step, and double download and parse of the file entirely though and instead simply parse schema definitions as we go, in the same pass which processes the row data. Unfortunately pg_dump dumps the index definitions after the row data though, while one of the big advantages of IMPORT is that we can generate all the KVs for a row, including index KVs, in one pass, so it isn't clear what the "right" way to do this is. We could optimistically say there are no indexes and import in one pass and then if/when we see indexes, either queue normal index creation and/or a second pass that just generates index kvs for those indexes?
The text was updated successfully, but these errors were encountered:
When importing "bundle" formats like pg_dump or mysqldump which include schema definitions and row-data in a single file, we currently download and parse the file to extract the schema definitions during import planning, because we want to resolve or create all the tables we will import into before we create the IMPORT job.
However when presented with something like a 300GB pg_dump file, this has the unfortunate effect of meaning that the IMPORT statement spends a long time in planning before creating a job, and in those minutes (or hours?) the user has no indication of what is going on -- there is no job to inspect or on which to report progress, even though we're clearly doing bulk-y work.
At the very least, it'd be more user-friendly to move the fetching and parsing of schemas to a prepare step of the actual import job execution, rather than in the planning phase. Ideally we'd avoid the step, and double download and parse of the file entirely though and instead simply parse schema definitions as we go, in the same pass which processes the row data. Unfortunately pg_dump dumps the index definitions after the row data though, while one of the big advantages of IMPORT is that we can generate all the KVs for a row, including index KVs, in one pass, so it isn't clear what the "right" way to do this is. We could optimistically say there are no indexes and import in one pass and then if/when we see indexes, either queue normal index creation and/or a second pass that just generates index kvs for those indexes?
The text was updated successfully, but these errors were encountered: