Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA: create sample datasets #4

Open
garyfeng opened this issue Nov 25, 2022 · 4 comments
Open

FEA: create sample datasets #4

garyfeng opened this issue Nov 25, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@garyfeng
Copy link
Owner

Methods:

  1. random generation of the data, and targets (add noises to selected vectors)
  2. generate voice data myself using open and proprietary datasets
  3. use sample data from https://github.com/erikbern/ann-benchmarks:
Dataset Dimensions Train size Test size Neighbors Distance Download
DEEP1B 96 9,990,000 10,000 100 Angular HDF5 (3.6GB)
Fashion-MNIST 784 60,000 10,000 100 Euclidean HDF5 (217MB)
GIST 960 1,000,000 1,000 100 Euclidean HDF5 (3.6GB)
GloVe 25 1,183,514 10,000 100 Angular HDF5 (121MB)
GloVe 50 1,183,514 10,000 100 Angular HDF5 (235MB)
GloVe 100 1,183,514 10,000 100 Angular HDF5 (463MB)
GloVe 200 1,183,514 10,000 100 Angular HDF5 (918MB)
Kosarak 27,983 74,962 500 100 Jaccard HDF5 (33MB)
MNIST 784 60,000 10,000 100 Euclidean HDF5 (217MB)
MovieLens-10M 65,134 69,363 500 100 Jaccard HDF5 (63MB)
NYTimes 256 290,000 10,000 100 Angular HDF5 (301MB)
SIFT 128 1,000,000 10,000 100 Euclidean HDF5 (501MB)
Last.fm 65 292,385 50,000 100 Angular HDF5 (135MB)
@garyfeng garyfeng added the enhancement New feature or request label Nov 25, 2022
@garyfeng garyfeng self-assigned this Nov 25, 2022
@garyfeng
Copy link
Owner Author

garyfeng commented Nov 25, 2022

downloading http://ann-benchmarks.com/nytimes-256-angular.hdf5, as it has 256 dimensions and 300k vectors, roughly in the ballpark of what we are looking for. Note you have to paste the above link to a new page, or use wget to download. Clicking the link doesn't work.

@garyfeng
Copy link
Owner Author

reading HDF5 vector files:

https://github.com/erikbern/ann-benchmarks/blob/14df47df72bb1deee5d9e68616ba03edf65d4680/ann_benchmarks/datasets.py#L25-L42

def get_dataset(which):
    hdf5_fn = get_dataset_fn(which)
    try:
        url = 'http://ann-benchmarks.com/%s.hdf5' % which
        download(url, hdf5_fn)
    except:
        print("Cannot download %s" % url)
        if which in DATASETS:
            print("Creating dataset locally")
            DATASETS[which](hdf5_fn)
    hdf5_f = h5py.File(hdf5_fn, 'r')

    # here for backward compatibility, to ensure old datasets can still be used with newer versions
    # cast to integer because the json parser (later on) cannot interpret numpy integers
    dimension = int(hdf5_f.attrs['dimension']) if 'dimension' in hdf5_f.attrs else len(hdf5_f['train'][0])

    return hdf5_f, dimension

@garyfeng
Copy link
Owner Author

for random samples, use the random_float() function:

https://github.com/erikbern/ann-benchmarks/blob/14df47df72bb1deee5d9e68616ba03edf65d4680/ann_benchmarks/datasets.py#L292

See also how they handled mixing real and fake data:

DATASETS = {
    'deep-image-96-angular': deep_image,
    'fashion-mnist-784-euclidean': fashion_mnist,
    'gist-960-euclidean': gist,
    'glove-25-angular': lambda out_fn: glove(out_fn, 25),
    'glove-50-angular': lambda out_fn: glove(out_fn, 50),
    'glove-100-angular': lambda out_fn: glove(out_fn, 100),
    'glove-200-angular': lambda out_fn: glove(out_fn, 200),
    'mnist-784-euclidean': mnist,
    'random-xs-20-euclidean': lambda out_fn: random_float(out_fn, 20, 10000, 100,
                                                    'euclidean'),
    'random-s-100-euclidean': lambda out_fn: random_float(out_fn, 100, 100000, 1000,
                                                    'euclidean'),
    'random-xs-20-angular': lambda out_fn: random_float(out_fn, 20, 10000, 100,
                                                  'angular'),
    'random-s-100-angular': lambda out_fn: random_float(out_fn, 100, 100000, 1000,
                                                  'angular'),
    'random-xs-16-hamming': lambda out_fn: random_bitstring(out_fn, 16, 10000,
                                                            100),
    'random-s-128-hamming': lambda out_fn: random_bitstring(out_fn, 128,
                                                            50000, 1000),
    'random-l-256-hamming': lambda out_fn: random_bitstring(out_fn, 256,
                                                            100000, 1000),
    'random-s-jaccard': lambda out_fn: random_jaccard(out_fn, n=10000,
                                                       size=20, universe=40),
    'random-l-jaccard': lambda out_fn: random_jaccard(out_fn, n=100000,
                                                       size=70, universe=100),
    'sift-128-euclidean': sift,
    'nytimes-256-angular': lambda out_fn: nytimes(out_fn, 256),
    'nytimes-16-angular': lambda out_fn: nytimes(out_fn, 16),
    'word2bits-800-hamming': lambda out_fn: word2bits(
        out_fn, '400K',
        'w2b_bitlevel1_size800_vocab400K'),
    'lastfm-64-dot': lambda out_fn: lastfm(out_fn, 64),
    'sift-256-hamming': lambda out_fn: sift_hamming(
        out_fn, 'sift.hamming.256'),
    'kosarak-jaccard': lambda out_fn: kosarak(out_fn),
    'movielens1m-jaccard': movielens1m,
    'movielens10m-jaccard': movielens10m,
    'movielens20m-jaccard': movielens20m,
}

@garyfeng
Copy link
Owner Author

For another example of creating random vectors, see https://github.com/milvus-io/bootcamp/blob/master/solutions/image/face_recognition_system/quick_deploy/server/src/example.py to use with Milvus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant