Author: Yuqing Lu
Database: KNB
Language: R
Contents:
-
notebooks:
- knb-notebook1.md: downloading the API and understanding its usage
- knb-notebook2.md: explore the database using their API
- knb-notebook3.md: find the most popular headers in the KNB database and download the datasets that have these headers
- knb-process.md: understand PISCO datasets and analyze their attributes
- knb-location.md: extract location and date information from PISCO, species and sea star wasting syndrome datasets and merge them
- knb-sckat.md: merge species count and size data and PISCO datasets in order to find a relation between count and size of species and temperature
-
data:
- generated:
- knb-attrs.csv:
Extracted from the KNB website, around 700,000 rows. Each row contains information of an individual dataset that has header information in its metadata file. - knb-pop-attrs.csv:
Common headers of the datasets, ordered by their frequencies. - pisco-locations-dates.csv:
Group PISCO datasets by location and date. - PISCOwSeason.csv:
Add season column to PISCO location and date datasets. - [ca_sea_star_vs_pisco.csv]
- []
- [pop_ds.csv]
- downloaded:
- seastarkat_size_count_totals_download.csv: Species count and size data (sea stars and katharina only) requested from MARINe.
- sswd_sea_star_observations_2019_0411.csv: Sea star wasting syndrome data requested from MARINe.
- phototranraw_download.csv: requested from MARINe.
- downloaded PISCO csv files: two large to show online; stored in a hard drive
This repo has the data, code and reports for my exploratary analysis on KNB, which is a website that aggregates ecology related datasets.
In order to access the data in KNB programmatically, I downloaded their API(notebook1). Then following Ciera's suggestion, in notebook3, I was able to find the most popular headers in their database, under the help of the KNB staff. Playing with the headers, I decided to work on the datasets from PISCO and I need combine all the datasets first, which are around 200GB in total.
In notebook3, I downloaded one PISCO dataset's xml file and data frame. From the metadata file I understood the general information of that one PISCO dataset, including the purpose, location, organization, attribute definition, etc. Then I made a few plots of the attributes for that dataset to see how each attribute varies over time.
In location.md and sckat.md, I aim to merge species and PISCO data using location and time in order for a relation between species and ocean temperature.