Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg_partman does not create new time-based partitions when tables are created far in the future #347

Closed
josharrington opened this issue Mar 25, 2021 · 6 comments

Comments

@josharrington
Copy link

We recently upgraded from an old version of pg_partman to 4.4.1. We noticed that one of our partitioned tables (readings in this example) was no longer creating new partitions and data was piling up in the parent table, causing slow performance.

The table is partitioned by month. We noticed we were missing some recent partitions after our upgrade at the end of 2020. We also noticed that some of our partitions were created far into the future but we're not sure about why, though we do rarely have a need to write to a table in the future. We have not yet migrated to native partitioning for this table.

   oid    | table_schema |    table_name     |  row_estimate  | total_bytes  | index_bytes | toast_bytes | table_bytes  |  total   |  index  |   toast    |  table
----------+--------------+-------------------+----------------+--------------+-------------+-------------+--------------+----------+---------+------------+---------
  1298216 | gfv2         | readings          |  2.3552326e+08 | 205526704128 | 28212035584 |    51019776 | 177263648768 | 191 GB   | 26 GB   | 49 MB      | 165 GB
  ...
 11997897 | gfv2         | readings_p2020_06 |  6.0450324e+07 |  57285255168 |  8254365696 |     2244608 |  49028644864 | 53 GB    | 7872 MB | 2192 kB    | 46 GB
 12026906 | gfv2         | readings_p2020_07 |  6.6907308e+07 |  61592092672 |  6618972160 |      155648 |  54972964864 | 57 GB    | 6312 MB | 152 kB     | 51 GB
  4851307 | gfv2         | readings_p2020_08 |  6.7446304e+07 |  62043029504 |  6677463040 |       81920 |  55365484544 | 58 GB    | 6368 MB | 80 kB      | 52 GB
 12143295 | gfv2         | readings_p2020_09 |   6.912816e+07 |  65476329472 |  9013354496 |       73728 |  56462901248 | 61 GB    | 8596 MB | 72 kB      | 53 GB
 13144814 | gfv2         | readings_p2020_10 |   7.247224e+07 |  69827420160 |  8948686848 |     1966080 |  60876767232 | 65 GB    | 8534 MB | 1920 kB    | 57 GB
 13053984 | gfv2         | readings_p2020_11 |   7.913043e+07 |  87165255680 |  9432932352 |    18472960 |  77713850368 | 81 GB    | 8996 MB | 18 MB      | 72 GB
  8402263 | gfv2         | readings_p2021_02 | 1.14944136e+08 | 102077898752 | 14568292352 |    23896064 |  87485710336 | 95 GB    | 14 GB   | 23 MB      | 81 GB
 11929131 | gfv2         | readings_p2021_03 | 1.03510456e+08 |  95011717120 | 13588316160 |    23355392 |  81400045568 | 88 GB    | 13 GB   | 22 MB      | 76 GB
  2768548 | gfv2         | readings_p2021_08 |              2 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  6398906 | gfv2         | readings_p2021_11 |              2 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
 10534989 | gfv2         | readings_p2023_03 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
 11571177 | gfv2         | readings_p2023_05 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  4777298 | gfv2         | readings_p2024_01 |              2 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  3209170 | gfv2         | readings_p2024_06 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  2768567 | gfv2         | readings_p2026_06 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  5565912 | gfv2         | readings_p2026_08 |              3 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  2768585 | gfv2         | readings_p2026_11 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
 11638160 | gfv2         | readings_p2027_07 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  3045382 | gfv2         | readings_p2027_09 |              0 |        90112 |       65536 |        8192 |        16384 | 88 kB    | 64 kB   | 8192 bytes | 16 kB
  3209204 | gfv2         | readings_p2027_10 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  3704388 | gfv2         | readings_p2027_11 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
  3912885 | gfv2         | readings_p2027_12 |              1 |       114688 |       65536 |        8192 |        40960 | 112 kB   | 64 kB   | 8192 bytes | 40 kB
 13194652 | gfv2         | readings_p2030_03 |              0 |        40960 |       32768 |        8192 |            0 | 40 kB    | 32 kB   | 8192 bytes | 0 bytes
 13194682 | gfv2         | readings_p2030_04 |              0 |        40960 |       32768 |        8192 |            0 | 40 kB    | 32 kB   | 8192 bytes | 0 bytes
 13194722 | gfv2         | readings_p2030_05 |              0 |        40960 |       32768 |        8192 |            0 | 40 kB    | 32 kB   | 8192 bytes | 0 bytes

When calling run_maintenance_proc(0, true, false, true) we saw:

NOTICE:  v_sql run_maintenance_proc: SELECT partman.run_maintenance('gfv2.readings', p_jobmon := 'f', p_debug := 't', p_analyze := 't')
NOTICE:  run_maint: v_partition_expression: "timestamp"
NOTICE:  run_maint: parent_table: gfv2.readings, v_last_partition: readings_p2030_05
NOTICE:  run_maint: v_current_partition_timestamp: 2021-03-24 20:37:49.686536+00, v_max_time_default: <NULL>
NOTICE:  run_maint before loop: current_partition_timestamp: 2021-03-24 20:37:49.686536+00, v_premade_count: 111, v_sub_timestamp_min: <NULL>, v_sub_timestamp_max: <NULL>

No new tables were being created. As an experiment, I decided to move all the data from those future partitions (>2021_03) into a temp table, delete the future partitions, and reinsert the data into the parent table. When calling run_maintenance_proc after that, it was correctly creating the new tables (we have premake=4):

 11929131 | gfv2         | readings_p2021_03 | 1.03510456e+08 |  95011717120 | 13588316160 |    23355392 |  81400045568 | 88 GB    | 13 GB   | 22 MB      | 76 GB
 18334176 | gfv2         | readings_p2021_04 |              0 |        40960 |       32768 |        8192 |           0 | 40 kB    | 32 kB   | 8192 bytes | 0 bytes
 18334219 | gfv2         | readings_p2021_05 |              0 |        81920 |       65536 |        8192 |        8192 | 80 kB    | 64 kB   | 8192 bytes | 8192 bytes
 18334263 | gfv2         | readings_p2021_06 |              0 |        40960 |       32768 |        8192 |           0 | 40 kB    | 32 kB   | 8192 bytes | 0 bytes
 18334309 | gfv2         | readings_p2021_07 |              0 |        40960 |       32768 |        8192 |           0 | 40 kB    | 32 kB   | 8192 bytes | 0 bytes

Since then, we've run partition_data_time against the parent to fill in the missing past parititions and clear out the parent table.

Looking into how run_maintenance works, it seems that it will only premake the tables if the latest table is within the premake range. We suspect this behavior must have changed between our old version (we were on either 1.8.7 or 2.3.3, though these versions may not be exact) since we only noticed this after upgrading PG10->PG12 and pgpartman along with it. Unfortunately the extension was not upgraded inside the DB at the same time which may be why we have some gaps in the past (but does not account for the findings above)

@josharrington
Copy link
Author

It would be great if run_maintenance checked for the existence of each candidate child table rather than making an assumption based on future partitions.

@josharrington
Copy link
Author

Additionally, we found that infinite_time_partitions did not change the behavior. Here is the config for the table:

parent_table               | gfv2.readings
control                    | timestamp
partition_type             | partman
partition_interval         | 1 mon
constraint_cols            |
premake                    | 4
optimize_trigger           | 4
optimize_constraint        | 4
epoch                      | none
inherit_fk                 | t
retention                  |
retention_schema           |
retention_keep_table       | t
retention_keep_index       | t
infinite_time_partitions   | t
datetime_string            | YYYY_MM
automatic_maintenance      | on
jobmon                     | t
sub_partition_set_full     | f
undo_in_progress           | f
trigger_exception_handling | f
upsert                     |
trigger_return_null        | t
template_table             |
publications               |
inherit_privileges         | t
constraint_valid           | t
subscription_refresh       |

@keithf4
Copy link
Collaborator

keithf4 commented Mar 25, 2021

So I'll explain how pg_partman works as of the latest version... Yes this has changed compared to how things worked back in 1.x and 2.x. It was changed so it works better, and in a more predictable way, for most cases.

Maintenance looks at the "newest" data that exists in the partition set, no matter if it's in the child tables or the parent/default. It checks both the latest child and the parent/default and takes the higher value.

See - https://github.com/pgpartman/pg_partman/blob/master/sql/functions/run_maintenance.sql#L228

It then makes the next partition based on that new data. This was done specifically due to maintenance issues that were encountered by other users when pg_partman made new tables based only on the child table values. Yes, if you have data that is far into the future that gets inserted, that means any new partitions that are made will also be far into the future and you will have a gap. But that data will not be lost, it will go into either the parent or default.

Because of this, it is highly recommended that you have close monitoring on data going into the parent/default. This is actually critically important for native partitioning since excessive data in the parent can greatly affect maintenance performance. (see #262 (comment)). As soon as you see any alerts for data going into the parent/default, that data should be handled ASAP to either discard it or partition it out as needed.

You can monitor for data in the parent/default by using the check_default() function. This also checks the parent table for trigger-based partitioning as well.

If you end up having gaps due to far future data, I did add a function to automatically fill those gaps. See partition_gap_fill().

If you know for a fact that you will be getting data far into the future within a specific time frame, please set your premake appropriately to handle your expected data range. PG12+ native partitioning can handle larger amounts of child tables much more efficiently, so if the large child table count is causing issues for you, please plan to upgrade. Also, if you're still using trigger-based partitioning, it is highly recommended to migrate to native unless you're using some specific feature that is not yet supported. I have written several guides in the docs folder for doing migrations.

The infinite_time_partitions is only there to keep making new partitions even when no new data is being added in a time series partition set. By default, no new partitions are made when no new data is being entered. This is how it works for ID partitioning (and cannot work any other way), so I made the default behavior consistent for both time & id. If the premake is already caught up, then it's not going to make any new partitions until they are needed based on the current time even with this set to true.

While it would be nice to have maintenance automatically check for gaps and fill them in, I likely won't be doing that for several reasons. One, I believe that could impact the performance of the maintenance process the more partition sets it is managing. Second, if you accidentally get far future or past data and you didn't intend to, you will have a bunch of child tables created that you don't need and have to clean that up.

So, in summary, set your premake high enough to handle your expected data range and closely monitor for data going into the parent/default.

Hopefully that helps explains how things work and lets you address the issues you've encountered. If you still have any questions, please let me know.

@josharrington
Copy link
Author

Thank you for this fantastic explanation. I was unaware of partition_gap_fill and I think we may be able to work around our issues with this. I am sad to hear that the feature will not be included but I understand your reasoning. Would including it behind a flag be acceptable?

@keithf4
Copy link
Collaborator

keithf4 commented Mar 25, 2021

A flag would be nice, maybe. But I'm honestly trying to avoid anymore configuration options unless they are absolutely necessary since the config table is getting a bit excessive.

@josharrington
Copy link
Author

Understood. Thank you for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants