Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Division by zero in printCellStatistics() #6121

Closed
pservit opened this issue Sep 8, 2021 · 14 comments
Closed

Division by zero in printCellStatistics() #6121

pservit opened this issue Sep 8, 2021 · 14 comments

Comments

@pservit
Copy link

pservit commented Sep 8, 2021

osrm-partition always fails processing planet after osrm-extract with foot profile.

After some debugging and adding logs I found that partition.GetNumberOfCells returns 0 for level=4 so division by zero occurs in printCellStatistics() line 42

  source /= num_cells;
  destination /= num_cells;

Here is full log for osrm-partition:

[info] Computing recursive bisection
[info] Loaded compressed node based graph: 736367212 edges, 1811657456 nodes
[info]  running partition: 128 1.2 0.25 10 1000 # max_cell_size balance boundary cuts small_component_size
[info] Found 1532677258 SCC (857 large, 1532676401 small)
[info] SCC run took: 93.7171s
[info] Full bisection done in 7846.82s
[info] Loaded node based graph to edge based graph mapping
[info] Loaded edge based graph for mapping partition ids: 2996357810 edges, 736366438 nodes
[info] Fixed 38099 unconnected nodes
[info] Edge-based-graph annotation:
[info]   level 1 #cells 2458529 bit size 22
[info]   level 2 #cells 258451 bit size 18
[info]   level 3 #cells 16310 bit size 14
[info]   level 4 #cells 534 bit size 10
[info] Renumbered data in 2297.4 seconds
[info] MultiLevelPartition constructed in 204.034 seconds
[info] CellStorage constructed in 48.4773 seconds
[info] MLD data writing took 64.3765 seconds
[info] Cells statistics per level
[info] Level 1
[info] num_cells 2458529
[info] source 42477944  num_cells 2458529
[info] destination 62028086  num_cells 2458529
[info] Level 1 #cells 2458529 #boundary nodes 62842547, sources: avg. 17, destinations: avg. 25, entries: 1581977611 (12655820888 bytes)
[info] Level 2
[info] num_cells 258451
[info] source 9323222  num_cells 258451
[info] destination 13481432  num_cells 258451
[info] Level 2 #cells 258451 #boundary nodes 13649358, sources: avg. 36, destinations: avg. 52, entries: 685047779 (5480382232 bytes)
[info] Level 3
[info] num_cells 16310
[info] source 1191758  num_cells 16310
[info] destination 1719894  num_cells 16310
[info] Level 3 #cells 16310 #boundary nodes 1751552, sources: avg. 73, destinations: avg. 105, entries: 191696027 (1533568216 bytes)
[info] Level 4
[info] num_cells 0
[info] source 0  num_cells 0
[info] destination 0  num_cells 0

-- fail --

I'm not familiar with osrm-backend internals. Is it normal behaviour to have zero cells on level 4 in MultiLevelPartition (notice [info] level 4 #cells 534 bit size 10 in edge-based-graph annotation stats)?

@mjjbell
Copy link
Member

mjjbell commented Sep 8, 2021

@pservit just to confirm, do you get the same error when running osrm-partition?

@pservit
Copy link
Author

pservit commented Sep 8, 2021

Sorry, above logs from osrm-partition
I edited my first message.

@mjjbell
Copy link
Member

mjjbell commented Sep 8, 2021

[info] level 1 #cells 2458529 bit size 22
[info] level 2 #cells 258451 bit size 18
[info] level 3 #cells 16310 bit size 14
[info] level 4 #cells 534 bit size 10

Summing the bit sizes: 22 + 18 + 14 + 10 = 64.
This immediately leads me to believe it's an overflow issue. Will try and reproduce.

@mjjbell
Copy link
Member

mjjbell commented Sep 8, 2021

This test reproduces the issue

BOOST_AUTO_TEST_CASE(cell_bit_overflow)
{
    std::vector<size_t> level_cells = {2458529, 258451, 16310, 534};
    std::vector<std::vector<CellID>> levels(level_cells.size(), std::vector<CellID>(level_cells[0]));
    std::vector<uint32_t> levels_to_num_cells(level_cells.size());

    auto level = 0;
    for (auto num_cells : level_cells)
    {
        for (auto val : util::irange<size_t>(0ULL, level_cells[0])) {
            levels[level][val] = std::min(val, num_cells-1);
        }
        levels_to_num_cells[level] = num_cells;
        ++level;
    }

    MultiLevelPartition mlp{levels, levels_to_num_cells};

    BOOST_REQUIRE_EQUAL(mlp.GetNumberOfCells(1), level_cells[0]);
    BOOST_REQUIRE_EQUAL(mlp.GetNumberOfCells(2), level_cells[1]);
    BOOST_REQUIRE_EQUAL(mlp.GetNumberOfCells(3), level_cells[2]);
    BOOST_REQUIRE_EQUAL(mlp.GetNumberOfCells(4), level_cells[3]);
}

It fails with

fatal error: in "multi_level_partition_tests/cell_bit_overflow": critical check mlp.GetNumberOfCells(4) == level_nodes[3] has failed [0 != 534]

In Debug builds this test triggers an assertion when building Level masks

BOOST_ASSERT(next_offset < NUM_PARTITION_BITS);

NUM_PARTITION_BITS is 64 unless your architecture is not using 8 bit chars.

This means there's an inconsistency with an earlier check. A runtime error should be thrown earlier, but it's only doing so for > 64 bit. This needs to be >= 64.

if (sum_bits > 64)
{
throw util::exception(
"Can't pack the partition information at level " + std::to_string(lidx) +
" into a 64bit integer. Would require " + std::to_string(sum_bits) + " bits.");
}

@mjjbell
Copy link
Member

mjjbell commented Sep 8, 2021

In summary, the OSM planet foot graph appears to now be too big for the OSRM MLD defaults.

I believe a workaround would be to reduce the number of cells at each level by increasing their maximum size.
You can do this by passing the --max-cell-sizes parameter to osrm-partition as a comma separated list.
The default values are here

max_cell_sizes({128, 128 * 32, 128 * 32 * 16, 128 * 32 * 16 * 32})

@pservit
Copy link
Author

pservit commented Sep 9, 2021

thank you!

tested osrm-partition with --max-cell-sizes=1024,16384,262144,4194304 - completed without error

[info] Edge-based-graph annotation:
[info]   level 1 #cells 387230 bit size 19
[info]   level 2 #cells 65352 bit size 16
[info]   level 3 #cells 4133 bit size 13
[info]   level 4 #cells 276 bit size 9

now I'm waiting for osrm-customize

@pservit
Copy link
Author

pservit commented Sep 9, 2021

[4960757.125398] osrm-customize[3189345]: segfault at 7ef692eca1f0 ip 0000559704c1fb8d sp 00007ef23affaab0 error 6

gdb backtrace: backtrace.txt

@pservit
Copy link
Author

pservit commented Sep 10, 2021

with debug symbols:

bt-all.txt
bt-full.txt

osrm-backend version 5.22.0

@mjjbell
Copy link
Member

mjjbell commented Sep 10, 2021

Thanks for the additional debug logs, that really helped.

It's another overflow issue.

Each cell has source and destination nodes. MLD is keeping a |source| x |destination| sized table for various metrics (distances, durations, etc) from each source to all destinations in a cell.

It stores all of the values for a metric in one large array, with an offset for each cell to find its values.

ValueOffset value_offset = INVALID_VALUE_OFFSET;

The offset is... 32 bit
using ValueOffset = std::uint32_t;

Looking at the cell data from the backtrace:

cells = std::vector of length 456991, capacity 456991 = {{value_offset = 0, source_boundary_offset = 0, destination_boundary_offset = 0,
num_source_nodes = 5908884, num_destination_nodes = 9553299}, {value_offset = 580436988, source_boundary_offset = 5908884,
destination_boundary_offset = 9553299, num_source_nodes = 442, num_destination_nodes = 691}, {value_offset = 580742410,
source_boundary_offset = 5909326, destination_boundary_offset = 9553990, num_source_nodes = 704, num_destination_nodes = 1086}, {
value_offset = 581506954, source_boundary_offset = 5910030, destination_boundary_offset = 9555076, num_source_nodes = 498,
num_destination_nodes = 731}, {value_offset = 581870992, source_boundary_offset = 5910528, destination_boundary_offset = 9555807,

Immediately we can see the first cell has 5908884 sources and 9553299 destinations, which is suspiciously large given the cell sizes were limited to 1024 at this level.

5908884 x 9553299 % 2^32 = 580436988, which is the value_offset for the start of the next cell - we've immediately overflowed.
I believe this massive cell is a manifestation of #6084, which is fixed in master (#6085).

Nevertheless, with 456991 cells and a conservative estimate of 250 sources and 250 destinations per cell, an overflow will still occur.

@mjjbell
Copy link
Member

mjjbell commented Sep 10, 2021

Applying #6085 will help reduce the size of this array, but you'll need to increase the Offset types to 64 bit and see how much further you get.

using ValueOffset = std::uint32_t;
using BoundaryOffset = std::uint32_t;

@pservit
Copy link
Author

pservit commented Sep 13, 2021

with this patch customize and routed works normally

diff --git a/include/partitioner/cell_storage.hpp b/include/partitioner/cell_storage.hpp
index 42c347861..40ddc17ed 100644
--- a/include/partitioner/cell_storage.hpp
+++ b/include/partitioner/cell_storage.hpp
@@ -52,8 +52,8 @@ namespace detail
 template <storage::Ownership Ownership> class CellStorageImpl
 {
   public:
-    using ValueOffset = std::uint32_t;
-    using BoundaryOffset = std::uint32_t;
+    using ValueOffset = std::uint64_t;
+    using BoundaryOffset = std::uint64_t;
     using BoundarySize = std::uint32_t;
     using SourceIndex = std::uint32_t;
     using DestinationIndex = std::uint32_t;

just some information:
full pipeline (extract, partition, customize) takes 3 days to complete on hetzner AX61-NVMe (24 threads, 128 GB RAM, 2x1.92 Tb nvme in raid1, swap 512GB)

@mjjbell
Copy link
Member

mjjbell commented Sep 22, 2021

Fixed in v5.26.0 via #6123 and #6124

@edumucelli
Copy link

edumucelli commented May 11, 2022

just some information: full pipeline (extract, partition, customize) takes 3 days to complete on hetzner AX61-NVMe (24 threads, 128 GB RAM, 2x1.92 Tb nvme in raid1, swap 512GB)

That is great information, how much memory for the runtime? Are you using mmap on osrm-routed?

@pservit
Copy link
Author

pservit commented May 11, 2022

    PID CID           SYSCPU  USRCPU   VGROW   RGROW   RDDSK   WRDSK  RUID      EUID      ST  EXC   THR  S  CPUNR   CPU  CMD       1/22
 190784 a6181acd443b   2h57m   6d11h  261.6G   66.9G    1.4T      0K  root      root      N-    -    26  S     10    3%  osrm-routed
 191875 b70acc26ba0b  77m29s  53h42m  183.1G   55.5G  756.6G      0K  root      root      N-    -    26  S      8    1%  osrm-routed

not using mmap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants