Kyeongpil / Dataset Public

Notifications You must be signed in to change notification settings
Fork 16
Star 63

Dataset I collected. Usually I have found these dataset in research papers.

63 stars 16 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.gitignore		.gitignore
README.md		README.md

Repository files navigation

Dataset

I have found these dataset in research papers.

Vision

Classification or Recognition or Generative

Coil-20

http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
STL-10: Self-taught learning

https://cs.stanford.edu/~acoates/stl10/
MS COCO

http://mscoco.org/dataset/#overview
US Post Office Zip Code Data

https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html
Google Conceptual Caption dataset

https://ai.google.com/research/ConceptualCaptions/download
Visual Storytelling Dataset (VIST)

http://visionandlanguage.net/VIST/
NVIDIA food Image classification

https://github.com/corona10/FoodDataSet
CIFAR-10, CIFAR-100

https://www.cs.toronto.edu/~kriz/cifar.html
Large-scale CelebFaces Attributes (CelebA) Dataset

http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Street View House Numbers (SVHN)

http://ufldl.stanford.edu/housenumbers/
MNIST

http://yann.lecun.com/exdb/mnist/
Facial Database

http://www.face-rec.org/databases/
Labeled Faces in the Wild

http://vis-www.cs.umass.edu/lfw/#download
Simple Vector Drawing Datasets

https://github.com/hardmaru/sketch-rnn-datasets
Places2 (공간 사진, 정보 데이터)

http://places2.csail.mit.edu/download.html
Yelp dataset (식당 정보, 사진)

https://www.yelp.com/dataset_challenge
DeepFashion

http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
Image to Latex (수식 이미지를 latex 코드로 만드는 데이터셋입니다.)

https://zenodo.org/record/56198#.WTpQ73XyhPN
NIST Dataset(Fingerprint, Mugshot, OCR)

https://www.nist.gov/srd/nist-special-database-4
Biometics ideal test dataset(Iris, Fingerprint, Face, palmprint, handwriting etc. - 로그인 필요!)

http://biometrics.idealtest.org/index.jsp
PASCAL 2012 Dataset (Classification & Detection)

http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html#data
Flickr Image Dataset

http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/flickr100k.html

http://hockenmaier.cs.illinois.edu/DenotationGraph/
Stanford dogs dataset

http://vision.stanford.edu/aditya86/ImageNetDogs/
CUB-200 dataset (birds)

http://www.vision.caltech.edu/visipedia/CUB-200-2011.html
Facial beauty score dataset

https://github.com/HCIILAB/SCUT-FBP5500-Database-Release
Tumblr GIF dataset

https://www.kaggle.com/raingo/tumblr-gif-description-dataset
Totally looks like dataset

https://sites.google.com/view/totally-looks-like-dataset
CAISA WebFace databaset

http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html
Labeled Faces in the Wild Home

http://vis-www.cs.umass.edu/lfw/
Behance Artistic Media Dataset

https://bam-dataset.org/#explore
Handwriting databaset

http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
ImageCLEF dataset - Cross language image retrieval task

https://www.imageclef.org/
Yale-b - The extended Yale face database

http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html
Visual Relationship Detection dataset

Images Annotations
Visual Genome dataset

http://visualgenome.org/
Oxford-102 dataset (Flower)

http://www.robots.ox.ac.uk/~vgg/data/flowers/102/
UCSD Pedestrian dataset (video anomaly detection)

http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm

Medical

Lung cancer dataset

https://luna.grand-challenge.org

https://www.kaggle.com/c/data-science-bowl-2017
Brain tumor dataset

http://braintumorsegmentation.org
Breast cancer dataset (kaggle)

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
The cancer image archive

http://www.cancerimagingarchive.net
Mammograpy dataset

http://marathon.csee.usf.edu/Mammography/Database.html
Bio Image Dataset @ IIIT Delhi

http://www.iab-rubric.org/resources.html
CAMELYON 16 - metatstasis detection in lymph node

https://camelyon16.grand-challenge.org/
CAMELYON17 Dataset https://camelyon17.grand-challenge.org/

Video & Image Stream

YouTube-BoundingBoxes Dataset

https://research.google.com/youtube-bb/index.html
Youtube-8M Dataset

https://research.google.com/youtube8m/
The Kinetics Human Action Video Dataset

https://deepmind.com/research/open-source/open-source-datasets/kinetics/
Announcing AVA: A Finely Labeled Video Dataset for Human Action Understanding

https://research.googleblog.com/2017/10/announcing-ava-finely-labeled-video.html?m=1
Microsoft Kinect dataset

https://www.microsoft.com/en-us/download/details.aspx?id=52283

Text

Machine Translation

StatMT(Machine Translation, summarization 등의 태스크를 위한 데이터셋으로 나라-나라 쌍의 데이터셋입니다.)

http://www.statmt.org/wmt14/translation-task.html

http://www.statmt.org/wmt15/translation-task.html

http://www.statmt.org/wmt16/translation-task.html

http://www.statmt.org/wmt17/translation-task.html
UN parallel Corpus

https://conferences.unite.un.org/UNCorpus
IWSLT Dataset (including TED Translation)

https://sites.google.com/site/iwsltevaluation2016/
The Stacks Project(대수기하학 책의 원본과 latex 코드 pair set?)

http://stacks.math.columbia.edu/
Google sentence compression(Google에서 문장을 정형화 한 데이터입니다.)

http://storage.googleapis.com/sentencecomp/compression-data.json
조선왕조실록(한글/한문 번역)

http://sillok.history.go.kr/main/main.do
OpenSubtitles

http://opus.nlpl.eu/OpenSubtitles2018.php

Categorical & Topic modeling

20 Newsgroups

http://qwone.com/~jason/20Newsgroups/
Reuter dataset

https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection
SNLI(Stanford Natural Language Inference) dataset

https://nlp.stanford.edu/projects/snli/

Short text

Tweet data, a subset of TREC 2011 microblog track

http://trec.nist.gov/data/tweets/
Title data, including news titles with class labels from some news websites

http://www.sogou.com/al
Italia earthquake twitter dataset

https://www.kaggle.com/blackecho/italy-earthquakes

Paraphrase

Paraphrase database

http://paraphrase.org/#/download

QA & Dialogue

bAbI dataset (Facebook Question Answering)

https://research.facebook.com/research/babi/
Question/Answering(빈칸추론문제) pairs using CNN/Daily Mail articles

https://github.com/deepmind/rc-data
Stanford Question Answering Dataset

https://rajpurkar.github.io/SQuAD-explorer/
Korean Squad dataset

https://korquad.github.io/
RACE Reading Comprehension datraset

http://www.qizhexie.com/data/RACE_leaderboard
GLUE (General Language Understanding Evaluation) benchmark dataset

https://gluebenchmark.com/
ClueWeb12 dataset (information retrieval)

https://lemurproject.org/clueweb12/
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

http://cs.stanford.edu/people/jcjohns/clevr/
WikiReading dataset

https://github.com/google-research-datasets/wiki-reading
SEMPRE: Semantic Parsing with Execution

https://nlp.stanford.edu/software/sempre/
Dialogue system datasets

https://breakend.github.io/DialogDatasets/
WikiSQL dataset

https://github.com/salesforce/WikiSQL
SynthText dataset

http://www.robots.ox.ac.uk/%7Evgg/data/scenetext/
Cornell Movie dialogue corpus

http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

Word Embedding

Word2Vec에 쓰인 데이터셋(위키피디아, WMT11 등) https://code.google.com/archive/p/word2vec/
Fast Text pre-trained vector set

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Sentiment Analysis

Stanford Sentiment Treebank(SST)

http://nlp.stanford.edu/sentiment/
Multi-Domain Sentiment Dataset

http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Visual sentiment ontology

http://www.ee.columbia.edu/ln/dvmm/vso/download/flickr_dataset.html
Radboud Face Database (rbfd)

http://www.socsci.ru.nl:8180/RaFD2/RaFD?p=main
Aspect sentiment analysis with aspect category https://github.com/hsqmlzno1/MGAN

Raw text

Common Crawl dataset

http://commoncrawl.org/the-data/

Sound

Nottingham music dataset

https://www-labs.iro.umontreal.ca/~lisa/deep/data/
A large-scale dataset of manually annotated audio events (Google research)

https://research.google.com/audioset/
Speech Command Dataset

https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html
Mozilla DeepSpeech

https://github.com/mozilla/DeepSpeech

Knowledge Base

Freebase

https://datahub.io/ko_KR/dataset/freebase
Wordnet

https://wordnet.princeton.edu/
Microsoft Concept Graph

https://concept.msra.cn/Home/Download
DBPedia Dataset

The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia as well as localized versions of DBpedia in more than 100 languages.

http://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets
Yago

YAGO3 is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames.

https://datahub.io/ko_KR/dataset/yago
Google Knowledge graph API

https://developers.google.com/knowledge-graph/

Social Networks & Recomendationdation

AMiner - Datasets for social network Analysis

https://cn.aminer.org/data

https://cn.aminer.org/aminernetwork
Netflix Prize Data Set

http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
논문 bibliography 데이터셋, Author Citation Networks

https://aminer.org/citation

http://dblp.uni-trier.de/

https://aminer.org/citation

http://www.cs.cornell.edu/projects/kddcup/datasets.html
Politics sub redit

http://snap.stanford.edu/data/politics_subreddit.tar.gz
Amazon dataset

http://snap.stanford.edu/data/amazon-meta.html

http://jmcauley.ucsd.edu/data/amazon/
Twitter Spammer network

http://twitter.mpi-sws.org/spam/
Twitter tweets

http://snap.stanford.edu/data/twitter7.html
Online reviews

http://snap.stanford.edu/data/#reviews
Rumor Detection dataset

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BFGAVZ
MovieLens

https://grouplens.org/datasets/movielens/
CiteULike

http://www.citeulike.org/faq/data.adp
LastFM - Music, network dataset

https://www.upf.edu/web/mtg/lastfm360k
Delicious Bookmarks (with other datasets)

https://grouplens.org/datasets/hetrec-2011/
Check-in dataset

https://sites.google.com/site/yangdingqi/home/foursquare-dataset
social event detection 2014 dataset

http://mklab.iti.gr/project/sed2014

Fake news dataset

Kaggle fake news dataset

https://www.kaggle.com/mrisdal/fake-news
Facebook fact check

https://github.com/BuzzFeedNews/2016-10-facebook-fact-check
FakeNewsNet dataset

https://github.com/KaiDMML/FakeNewsNet
FakeNews corpus

https://github.com/several27/FakeNewsCorpus
Liar dataset

https://www.cs.ucsb.edu/~william/data/liar_dataset.zip

Pre-trained Model

Word2Vect

https://code.google.com/archive/p/word2vec/
GloVe

https://nlp.stanford.edu/projects/glove/
FastText

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Dataset Repo

Havard Dataverse

https://dataverse.harvard.edu/

국내 데이터셋

SKT Bigdata hub

https://www.bigdatahub.co.kr/index.do
ETRI 말뭉치

http://aiopen.etri.re.kr/service_corpus.php

ETC.

Titanic survivors dataset

https://goo.gl/P9CMFY
Obama’s political speeches

https://github.com/samim23/obama-rnn
Yahoo Finance dataset

https://finance.yahoo.com/quote/GOOG/history?ltr=1
Linux code

https://github.com/torvalds/linux
NYC Taxi dataset

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
US Census dataset

https://www.census.gov/topics/income-poverty/income/data/datasets.html

About

Dataset I collected. Usually I have found these dataset in research papers.

Report repository

Releases

No releases published

Packages

No packages published