uniform_erdos_renyi_hypergraph produces way too many edges #339

arashbm · 2023-04-19T22:30:10Z

Unless I'm seriously mistaken as to how this is supposed to work, the number of edges produced by uniform_erdos_renyi_hypergraph is off by a factor of $m!$.

import xgi
import matplotlib.pyplot as plt

p = 1e-6
n = 1000
m = 3
plt.hist([len(xgi.uniform_erdos_renyi_hypergraph(n, m, p, p_type="prob").edges) for i in range(1000)],
    label="Num. of edges")
plt.axvline(math.comb(n, m)*p, label="Exp. number of edges")
plt.legend()

Possible solutions: lower q by $m!$ or even better, use nth combination instead of n-th product when defining _index_to_edge.

The text was updated successfully, but these errors were encountered:

leotrs · 2023-04-20T06:47:03Z

Hi @arashbm thanks for your report. We'll look into this shortly!

maximelucas · 2023-04-20T07:17:42Z

Thanks @arashbm ! We'll have a look.

We also have the plan to move than function to random and rename it fast_random_hypergraph and have it deal with both uniform and non-uniform hypergraphs (see #321 (comment)).

In the meantime, you can instead use xgi.random_hypergraph() with the argument order, which does the same thing in a different way (see docs).

arashbm · 2023-04-21T12:43:31Z

@maximelucas The combinatorial explosion of the mask array makes it impossible to use in some scenarios: e.g.

>>> import xgi
>>> xgi.random_hypergraph(1000, order=5, ps=[1e-12])  # => should create a network with ~1350 edges
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 9.72 PiB for an array with shape (1368173298991500,) and data type float64

But that also has an easy solution of not pre-allocating mask:

import xgi
import numpy as np
import itertools

def random_hypergraph(N, ps, order=None, seed=None):
    rng = np.random.default_rng(seed)

    if order is not None:
        if len(ps) != 1:
            raise ValueError("ps must contain a single element if order is an int")

    if (np.any(np.array(ps) < 0)) or (np.any(np.array(ps) > 1)):
        raise ValueError("All elements of ps must be between 0 and 1 included.")

    nodes = range(N)
    hyperedges = []

    for i, p in enumerate(ps):

        if order is not None:
            d = order
        else:
            d = i + 1  # order, ps[0] is prob of edges (d=1)

        potential_edges = itertools.combinations(nodes, d + 1)
        edges_to_add = [e for e in potential_edges if rng.random() <= p]

        hyperedges += edges_to_add

    H = xgi.empty_hypergraph()
    H.add_nodes_from(nodes)
    H.add_edges_from(hyperedges)

    return H

This doesn't fail but this still takes $O({{N}\choose{m}})$ time to run. The original algorithm (with geometric "jumps") is still superior.

nwlandry · 2023-04-21T13:36:54Z

Hi @arashbm ! Thanks for raising this issue! I added this function to XGI with my most recent ArXiv paper (https://arxiv.org/abs/2302.13967, see appendix). The problem is because we are iterating over the Cartesian product, not the unique combinations. I did this because the index _to_edge function is much slower when getting a combination from an index instead on a product from an index. If the hypergraph is sparse enough, than it should be fine though. I will try to dig up the combinations version of index_to_edge and we can chat more!

arashbm · 2023-04-21T19:15:53Z

@nwlandry That makes total sense. My current WIP implementation is also based on Cartesian product as this help alleviate the issue with fixed sized integers, but I have to reject any candidate edge (node combinations) that are not strictly monotonically increasing order, which makes for a lot of wasted CPU cycles.

Nice paper BTW. I quite liked the presentation in SIAM NS.

nwlandry · 2023-04-24T13:58:55Z

Thank you!! :)

Okay, I ran the following:

from scipy.special import comb
import random


def _index_to_edge_comb(i, n, k):
    """
    returns the i-th combination of k numbers chosen from 0,1,2,...,n-1
    """
    c = []
    r = i
    j = -1
    for s in range(1,k+1):
        cs = j+1
        while r - comb(n-1-cs,k-s, exact=True) > 0:
            r -= comb(n-1-cs,k-s, exact=True)
            cs += 1
        c.append(cs)
        j = cs
    return c

def _index_to_edge(index, n, m):
    """Generate a hyperedge given an index in the list of possible edges.

    Parameters
    ----------
    index : int > 0
        The index of the hyperedge in the list of all possible hyperedges.
    n : int > 0
        The number of nodes
    m : int > 0
        The hyperedge size.

    Returns
    -------
    list
        The reconstructed hyperedge

    See Also
    --------
    _index_to_edge_partition

    References
    ----------
    https://stackoverflow.com/questions/53834707/element-at-index-in-itertools-product
    """
    return [(index // (n**r) % n) for r in range(m - 1, -1, -1)]

# Run timeit on the results:
n = 1000
m = 3

max_prod = n**m
max_comb = comb(n, m, exact=True)

%timeit _index_to_edge(random.randrange(max_prod), n, m)
%timeit _index_to_edge_comb(random.randrange(max_comb), n, m)

The results were 1.02 µs ± 5.64 ns per loop when iterating over the cartesian product and 315 µs ± 6.63 µs per loop when iterating over unique combinations. Of course, there is a factor of $\approx m!$ fewer indices to loop over as you previously mentioned, but even when accounting for this difference, iterating over the combinations is ~50x slower.

nwlandry · 2023-04-24T14:02:45Z

I think that at the very least, the documentation could be improved to explain exactly what the probability is of (i.e., cartesian product instead of combinations). Any other things that you would suggest?

arashbm · 2023-04-24T15:18:09Z

I think it's a problem of naming and expectations. In dyadic undirected Erdős-Rényi $G(n, p)$ networks, edges $(i, j)$ and $(j, i)$ don't get a separate Bernoulli trial each with probability $p$. But if I use uniform_erdos_renyi_hypergraph(n, 2, p, p_type="prob") I get twice as many edges as what i expect, $p \frac{n (n-1)}{2}$.

Of course one alternative is to just call it something else and add to the documentation and hope that that's enough distinction to avoid confusion.

Another alternative is to continue with the Cartesian as it is but divide $p$ by $m!$ before proceeding. This works but has the downside of creating duplicate edges, a behaviour that is again inconsistent with what people would expect from something called erdos_renyi.

You can continue with the Cartesian and keep p as it is, but replace the len(set(...)) == m check to something more restrictive e.g., reject all edges except the ones that have a strictly ascending order of nodes. This has the $m!$ overhead of extra loops but no duplicates.

I would say no amount of documentation is going to fix API usability problems. I'm not even sure renaming would create enough distinction. But in any case I'm not in a position to suggest if you should break compatibility or not.

maximelucas · 2023-04-26T10:10:05Z

The combinatorial explosion of the mask array makes it impossible to use in some scenarios
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 9.72 PiB for an array with shape

Ah that's a fair point @arashbm, thanks!

But that also has an easy solution of not pre-allocating mask:

Yea that's actually how we used to do it. We changed to using a numpy array to make it slightly faster, but I didn't think of the problem you mentioned. I'll change it back.

maximelucas · 2023-11-03T17:30:52Z

I opened #361 related to this, is there anything else for which this issue should remain open? @nwlandry

nwlandry · 2023-11-03T17:32:38Z

Yeah, I don't think I've addressed this yet, but I will prioritize getting this done.

nwlandry mentioned this issue Sep 16, 2024

Add ability to generate ER hypergraphs without multiedges #596

Merged

nwlandry closed this as completed in #596 Oct 10, 2024

maximelucas mentioned this issue May 15, 2023

remove np random generators for memory reasons #361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uniform_erdos_renyi_hypergraph produces way too many edges #339

uniform_erdos_renyi_hypergraph produces way too many edges #339

arashbm commented Apr 19, 2023 •

edited

Loading

leotrs commented Apr 20, 2023

maximelucas commented Apr 20, 2023

arashbm commented Apr 21, 2023

nwlandry commented Apr 21, 2023

arashbm commented Apr 21, 2023

nwlandry commented Apr 24, 2023 •

edited

Loading

nwlandry commented Apr 24, 2023

arashbm commented Apr 24, 2023

maximelucas commented Apr 26, 2023

maximelucas commented Nov 3, 2023

nwlandry commented Nov 3, 2023

uniform_erdos_renyi_hypergraph produces way too many edges #339

uniform_erdos_renyi_hypergraph produces way too many edges #339

Comments

arashbm commented Apr 19, 2023 • edited Loading

leotrs commented Apr 20, 2023

maximelucas commented Apr 20, 2023

arashbm commented Apr 21, 2023

nwlandry commented Apr 21, 2023

arashbm commented Apr 21, 2023

nwlandry commented Apr 24, 2023 • edited Loading

nwlandry commented Apr 24, 2023

arashbm commented Apr 24, 2023

maximelucas commented Apr 26, 2023

maximelucas commented Nov 3, 2023

nwlandry commented Nov 3, 2023

arashbm commented Apr 19, 2023 •

edited

Loading

nwlandry commented Apr 24, 2023 •

edited

Loading