Switch to a better dataset #2

colinsauze · 2019-07-16T11:05:59Z

The lesson ideally needs to use one dataset throughout. Its currently a bit of a mixture with gapminder, world bank, hand written digits and randomly generated data.

Suggestions from Carpentry Connect Manchester include:

Edinburgh cycle data: https://edinburghcyclehire.com/open-data
Possibly coupled with weather data: https://www.metoffice.gov.uk/research/climate/maps-and-data/data/haduk-grid/data-formats
Seattle cycling data: https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/
Wine data https://archive.ics.uci.edu/ml/datasets/Wine
Titanic https://www.kaggle.com/c/titanic
Breast cancer data from sklearn
Kaggle competition datasets

vinisalazar · 2021-05-14T14:22:20Z

I visited some of these links, here are some quick impressions:

Edinburgh cycle data: https://edinburghcyclehire.com/open-data

Looks interesting but it's mainly time and location based. I would probably favor a dataset with mostly counts data, and maybe some categorical variables.

Possibly coupled with weather data: https://www.metoffice.gov.uk/research/climate/maps-and-data/data/haduk-grid/data-formats

These seems to be available in netCDF only. Although Python has excellent tools to deal with the format, it seems like an unnecessary cognitive load.

Seattle cycling data: https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/

Same as the Edinburgh data. The URL does not seem very maintainable. Also, I'd avoid presenting a third-party analysis (although the post is a really good one) at the start / setup of the lesson, as it may distract learners. It could perhaps be presented afterwards.

Wine data https://archive.ics.uci.edu/ml/datasets/Wine

I quite like this one, and specially the fact that it is deposited in the UCI MLR, because it is very well-known and seems very stable. The only downsides are the lack of categorical variables and the lack of a header with column names in the raw data file.

Titanic https://www.kaggle.com/c/titanic

This dataset seems very appropriate, but I dislike the fact of needing to accept the Kaggle Terms of Service in order to be able to download it. It would be much nicer to simply have an URL or repository that can be downloaded with wget or some other tool. A second disadvantage is that it is already split into Training and Test datasets. I guess it would be nicer to have a 'full' dataset and introduce the concept of splitting it further in the lesson.

Breast cancer data from sklearn

This is the dataset I like the most from that list. Being a biologist, I am inevitably biased towards using it :) . I also really like the fact that it is already built into Scikit Learn.

Kaggle competition datasets

Same comment as the Titanic dataset. One that I really like is Palmer Penguins!

colinsauze · 2021-05-19T16:56:11Z

Thinking about the requirements for a dataset it ideally needs to work with all of the following:

linear regression
logarithmic regression
clustering
(non deep learning) neural networks
unsupervised dimensionality reduction such as PCA or t-SNE

Assuming the licensing permits we can always redistribute the dataset along with this lesson (as is currently being done). This still lets us use the wget/curl method to download while having a stable URL.

I also like the idea of the Palmer Penguins, its being used in the introduction to deep learning lesson too and I envisage that these two lessons should be complementary.

bkmgit · 2021-06-25T10:05:01Z

An interesting data set:

https://github.com/MedMNIST/MedMNIST
This could be used in the dimensionality reduction lesson to complement the MNIST dataset
WIP: Dimension Reduction Update #15

vinisalazar · 2021-06-25T14:23:04Z

That MedMNIST dataset is quite interesting indeed. However, after reflecting and some conversations with other members of the community, I would tend to avoid medical datasets (including the Breast Cancer data that I endorsed in a previous comment), as people can be sensitive to them.

colinsauze · 2021-06-28T17:01:35Z

In the long term I do wonder if there is a way we could have custom versions of this lesson using different datasets. Then a medical group could use a version with medical data and another group could use their own dataset. But this would add a lot of complexity and I think we've got a lot of much more basic problems to solve first.

Just to add another dataset into the list, there is a weather prediction dataset (https://github.com/florian-huber/weather_prediction_dataset) which is being used by the Deep Learning incubator lesson.

bkmgit · 2021-06-28T17:23:11Z

There are a number of example datasets used for educational purposes. Assuming the lesson will become part of data carpentry, then one should expect at least a social science track, an ecology track, a genomics track and possibly a geospatial track. Astronomy, economics and image processing tracks are also in development.

Minor changes can be accommodated with selecting options when forking the repository to prepare a lesson - in the same way options are chosen to create a workshop website.

Merge new changes from Mikes repo

agitter mentioned this issue Aug 6, 2019

Machine learning carpentry carpentries-incubator/ml4bio-workshop#56

Open

colinsauze added the help wanted Extra attention is needed label Feb 22, 2021

colinsauze pushed a commit that referenced this issue Sep 25, 2024

Merge pull request #2 from mike-ivs/gh-pages

7d311dd

Merge new changes from Mikes repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to a better dataset #2

Switch to a better dataset #2

colinsauze commented Jul 16, 2019 •

edited

Loading

vinisalazar commented May 14, 2021

colinsauze commented May 19, 2021

bkmgit commented Jun 25, 2021

vinisalazar commented Jun 25, 2021

colinsauze commented Jun 28, 2021

bkmgit commented Jun 28, 2021

Switch to a better dataset #2

Switch to a better dataset #2

Comments

colinsauze commented Jul 16, 2019 • edited Loading

vinisalazar commented May 14, 2021

colinsauze commented May 19, 2021

bkmgit commented Jun 25, 2021

vinisalazar commented Jun 25, 2021

colinsauze commented Jun 28, 2021

bkmgit commented Jun 28, 2021

colinsauze commented Jul 16, 2019 •

edited

Loading