Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case with umap_embedding #6

Open
0xJustin opened this issue Jan 31, 2025 · 1 comment
Open

Edge case with umap_embedding #6

0xJustin opened this issue Jan 31, 2025 · 1 comment

Comments

@0xJustin
Copy link

0xJustin commented Jan 31, 2025

When a matrix with only one row with nonzero variance and all the rest have zero variance, the model doesn't fit-
Error is:
NotFittedError: This UMAP instance is not fitted yet.
on line:
connected_vertices_mask = ~disconnected_vertices(reducer)
Proposed solution:


def umap_embedding(
    X: np.ndarray,
    n_neighbors: int = 5,
    min_dist: float = 0.12,
    spread: float = 9.0,
    random_state: int = 42,
    n_components: int = 2,
    metric: str = "correlation",
    n_epochs: int = 1500,
    **kwargs,
) -> Tuple[np.ndarray, np.ndarray, UMAP]:
    from umap.utils import disconnected_vertices
    """
    Perform UMAP embedding on input data.

    Args:
        X: Input data with shape (n_samples, n_features).
        n_neighbors: Number of neighbors to consider for each point.
        min_dist: Minimum distance between points in the embedding space.
        spread: Determines how spread out all embedded points are overall.
        random_state: Random seed for reproducibility.
        n_components: Number of dimensions in the embedding space.
        metric: Distance metric to use.
        n_epochs: Number of training epochs for embedding optimization.
        **kwargs: Additional keyword arguments for UMAP.

    Returns:
        A tuple containing:
        - embedding: The UMAP embedding (n_samples, n_components). May be NaN if insufficient data.
        - mask: Boolean mask (length n_samples) showing which rows had nonzero variance and were connected.
        - reducer: The fitted UMAP object or None if insufficient data.

    Raises:
        ValueError: If n_components is too large relative to sample size.

    Note:
        This function handles reshaping of input data and removes constant rows.
    """
    if n_components > X.shape[0] - 2:
        raise ValueError(
            "number of components must be 2 smaller than sample size. "
            "See: https://github.com/lmcinnes/umap/issues/201"
        )

    if len(X.shape) > 2:
        # Flatten (n_samples, n_features_1, ...) → (n_samples, n_features)
        X = X.reshape(X.shape[0], -1)

    # Prepare an output array of NaNs.
    n_samples = X.shape[0]
    embedding = np.full((n_samples, n_components), np.nan)

    # Mask out rows that have zero (or near-zero) variance.
    mask = ~np.isclose(X.std(axis=1), 0)
    X_nonconst = X[mask]

    # If fewer than 2 rows remain, skip UMAP and return embedding of NaNs.
    if X_nonconst.shape[0] < 2:
        return embedding, mask, None

    # Fit UMAP
    reducer = UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        random_state=random_state,
        n_components=n_components,
        metric=metric,
        spread=spread,
        n_epochs=n_epochs,
        **kwargs,
    )
    _embedding = reducer.fit_transform(X_nonconst)

    # Remove any “disconnected” vertices UMAP couldn’t place
    # (e.g. if the graph is disjoint).
    connected_vertices_mask = ~disconnected_vertices(reducer)

    # Incorporate the connected-vertices mask into our existing mask.
    mask[mask] = mask[mask] & connected_vertices_mask

    # Place the valid embeddings back into the final array.
    embedding[mask] = _embedding[connected_vertices_mask]

    return embedding, mask, reducer
@lappalainenj
Copy link
Contributor

Thanks for reporting this! Would you mind creating a PR and adding a little unit test for this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants