Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniform_erdos_renyi_hypergraph produces way too many edges #339

Closed
arashbm opened this issue Apr 19, 2023 · 11 comments · Fixed by #596
Closed

uniform_erdos_renyi_hypergraph produces way too many edges #339

arashbm opened this issue Apr 19, 2023 · 11 comments · Fixed by #596

Comments

@arashbm
Copy link

arashbm commented Apr 19, 2023

Unless I'm seriously mistaken as to how this is supposed to work, the number of edges produced by uniform_erdos_renyi_hypergraph is off by a factor of $m!$.

import xgi
import matplotlib.pyplot as plt

p = 1e-6
n = 1000
m = 3
plt.hist([len(xgi.uniform_erdos_renyi_hypergraph(n, m, p, p_type="prob").edges) for i in range(1000)],
    label="Num. of edges")
plt.axvline(math.comb(n, m)*p, label="Exp. number of edges")
plt.legend()

download

Possible solutions: lower q by $m!$ or even better, use nth combination instead of n-th product when defining _index_to_edge.

@leotrs
Copy link
Collaborator

leotrs commented Apr 20, 2023

Hi @arashbm thanks for your report. We'll look into this shortly!

@maximelucas
Copy link
Collaborator

Thanks @arashbm ! We'll have a look.

We also have the plan to move than function to random and rename it fast_random_hypergraph and have it deal with both uniform and non-uniform hypergraphs (see #321 (comment)).

In the meantime, you can instead use xgi.random_hypergraph() with the argument order, which does the same thing in a different way (see docs).

@arashbm
Copy link
Author

arashbm commented Apr 21, 2023

@maximelucas The combinatorial explosion of the mask array makes it impossible to use in some scenarios: e.g.

>>> import xgi
>>> xgi.random_hypergraph(1000, order=5, ps=[1e-12])  # => should create a network with ~1350 edges
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 9.72 PiB for an array with shape (1368173298991500,) and data type float64

But that also has an easy solution of not pre-allocating mask:

import xgi
import numpy as np
import itertools

def random_hypergraph(N, ps, order=None, seed=None):
    rng = np.random.default_rng(seed)

    if order is not None:
        if len(ps) != 1:
            raise ValueError("ps must contain a single element if order is an int")

    if (np.any(np.array(ps) < 0)) or (np.any(np.array(ps) > 1)):
        raise ValueError("All elements of ps must be between 0 and 1 included.")

    nodes = range(N)
    hyperedges = []

    for i, p in enumerate(ps):

        if order is not None:
            d = order
        else:
            d = i + 1  # order, ps[0] is prob of edges (d=1)

        potential_edges = itertools.combinations(nodes, d + 1)
        edges_to_add = [e for e in potential_edges if rng.random() <= p]

        hyperedges += edges_to_add

    H = xgi.empty_hypergraph()
    H.add_nodes_from(nodes)
    H.add_edges_from(hyperedges)

    return H

This doesn't fail but this still takes $O({{N}\choose{m}})$ time to run. The original algorithm (with geometric "jumps") is still superior.

@nwlandry
Copy link
Collaborator

Hi @arashbm ! Thanks for raising this issue! I added this function to XGI with my most recent ArXiv paper (https://arxiv.org/abs/2302.13967, see appendix). The problem is because we are iterating over the Cartesian product, not the unique combinations. I did this because the index _to_edge function is much slower when getting a combination from an index instead on a product from an index. If the hypergraph is sparse enough, than it should be fine though. I will try to dig up the combinations version of index_to_edge and we can chat more!

@arashbm
Copy link
Author

arashbm commented Apr 21, 2023

@nwlandry That makes total sense. My current WIP implementation is also based on Cartesian product as this help alleviate the issue with fixed sized integers, but I have to reject any candidate edge (node combinations) that are not strictly monotonically increasing order, which makes for a lot of wasted CPU cycles.

Nice paper BTW. I quite liked the presentation in SIAM NS.

@nwlandry
Copy link
Collaborator

nwlandry commented Apr 24, 2023

Thank you!! :)

Okay, I ran the following:

from scipy.special import comb
import random


def _index_to_edge_comb(i, n, k):
    """
    returns the i-th combination of k numbers chosen from 0,1,2,...,n-1
    """
    c = []
    r = i
    j = -1
    for s in range(1,k+1):
        cs = j+1
        while r - comb(n-1-cs,k-s, exact=True) > 0:
            r -= comb(n-1-cs,k-s, exact=True)
            cs += 1
        c.append(cs)
        j = cs
    return c

def _index_to_edge(index, n, m):
    """Generate a hyperedge given an index in the list of possible edges.

    Parameters
    ----------
    index : int > 0
        The index of the hyperedge in the list of all possible hyperedges.
    n : int > 0
        The number of nodes
    m : int > 0
        The hyperedge size.

    Returns
    -------
    list
        The reconstructed hyperedge

    See Also
    --------
    _index_to_edge_partition

    References
    ----------
    https://stackoverflow.com/questions/53834707/element-at-index-in-itertools-product
    """
    return [(index // (n**r) % n) for r in range(m - 1, -1, -1)]

# Run timeit on the results:
n = 1000
m = 3

max_prod = n**m
max_comb = comb(n, m, exact=True)

%timeit _index_to_edge(random.randrange(max_prod), n, m)
%timeit _index_to_edge_comb(random.randrange(max_comb), n, m)

The results were 1.02 µs ± 5.64 ns per loop when iterating over the cartesian product and 315 µs ± 6.63 µs per loop when iterating over unique combinations. Of course, there is a factor of $\approx m!$ fewer indices to loop over as you previously mentioned, but even when accounting for this difference, iterating over the combinations is ~50x slower.

@nwlandry
Copy link
Collaborator

I think that at the very least, the documentation could be improved to explain exactly what the probability is of (i.e., cartesian product instead of combinations). Any other things that you would suggest?

@arashbm
Copy link
Author

arashbm commented Apr 24, 2023

I think it's a problem of naming and expectations. In dyadic undirected Erdős-Rényi $G(n, p)$ networks, edges $(i, j)$ and $(j, i)$ don't get a separate Bernoulli trial each with probability $p$. But if I use uniform_erdos_renyi_hypergraph(n, 2, p, p_type="prob") I get twice as many edges as what i expect, $p \frac{n (n-1)}{2}$.

Of course one alternative is to just call it something else and add to the documentation and hope that that's enough distinction to avoid confusion.

Another alternative is to continue with the Cartesian as it is but divide $p$ by $m!$ before proceeding. This works but has the downside of creating duplicate edges, a behaviour that is again inconsistent with what people would expect from something called erdos_renyi.

You can continue with the Cartesian and keep p as it is, but replace the len(set(...)) == m check to something more restrictive e.g., reject all edges except the ones that have a strictly ascending order of nodes. This has the $m!$ overhead of extra loops but no duplicates.

I would say no amount of documentation is going to fix API usability problems. I'm not even sure renaming would create enough distinction. But in any case I'm not in a position to suggest if you should break compatibility or not.

@maximelucas
Copy link
Collaborator

The combinatorial explosion of the mask array makes it impossible to use in some scenarios

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 9.72 PiB for an array with shape

Ah that's a fair point @arashbm, thanks!

But that also has an easy solution of not pre-allocating mask:

Yea that's actually how we used to do it. We changed to using a numpy array to make it slightly faster, but I didn't think of the problem you mentioned. I'll change it back.

@maximelucas
Copy link
Collaborator

I opened #361 related to this, is there anything else for which this issue should remain open? @nwlandry

@nwlandry
Copy link
Collaborator

nwlandry commented Nov 3, 2023

Yeah, I don't think I've addressed this yet, but I will prioritize getting this done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants