Skip to content

Commit

Permalink
Merge branch 'release-0.2'
Browse files Browse the repository at this point in the history
  • Loading branch information
bbengfort committed Jan 27, 2016
2 parents c863e42 + bbddce4 commit ba108e8
Show file tree
Hide file tree
Showing 27 changed files with 618 additions and 154 deletions.
33 changes: 32 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,17 @@

[![Build Status][travis_img]][travis_href]
[![Coverage Status][coveralls_img]][coverals_href]
[![Documentation Status][rtfd_img]][rtfd_href]
[![Stories in Ready][waffle_img]][waffle_href]

[![Colorful Wall](docs/img/wall.jpg)][wall.jpg]
[![Colorful Wall](docs/images/wall.jpg)][wall.jpg]

## About

This is a dataset management and visualization tool that is being built as part of the DDL Multidimensional Visualization Research Lab. See: [Parallel Coordinates](http://homes.cs.washington.edu/~jheer//files/zoo/ex/stats/parallel.html) for more on the types of visualizations we're experimenting with.

For more information, please enjoy the documentation found at [trinket.readthedocs.org](http://trinket.readthedocs.org/).

### Contributing

Trinket is open source, but because this is an District Data Labs project, we would appreciate it if you would let us know how you intend to use the software (other than simply copying and pasting code so that you can use it in your own projects). If you would like to contribute (especially if you are a student or research labs member at District Data Labs), you can do so in the following ways:
Expand Down Expand Up @@ -38,15 +41,43 @@ If you are a member of the District Data Labs Faculty group, you have direct acc

4. Repeat. Releases will be routinely pushed into master via release branches, then deployed to the server.

### Throughput

[![Throughput Graph](https://graphs.waffle.io/DistrictDataLabs/trinket/throughput.svg)](https://waffle.io/DistrictDataLabs/trinket/metrics)

### Attribution

The image used in this README, ["window#1"][wall.jpg] by [Namelas Frade](https://www.flickr.com/photos/zingh/) is licensed under [CC BY-NC-ND 2.0](https://creativecommons.org/licenses/by-nc-nd/2.0/)

## Changelog

The release versions that are sent to the Python package index (PyPI) are also tagged in Github. You can see the tags through the Github web application and download the tarball of the version you'd like. Additionally PyPI will host the various releases of Trinket (eventually).

The versioning uses a three part version system, "a.b.c" - "a" represents a major release that may not be backwards compatible. "b" is incremented on minor releases that may contain extra features, but are backwards compatible. "c" releases are bug fixes or other micro changes that developers should feel free to immediately update to.

### Version 0.2

* **tag**: [v0.2](https://github.com/DistrictDataLabs/trinket/releases/tag/v0.2)
* **deployment**: Wednesday, January 27, 2016
* **commit**: (see tag)

This minor update gave a bit more functionality to the MVP prototype, even though the version was intended to have a much more impactful feature set. However after some study, the workflow is changing, and so this development branch is being pruned and deployed in preparation for the next batch. The major achievement of this version is the documentation that discusses our approach, as well as the dataset search and listing page that is now available.

### Version 0.1

* **tag**: [v0.1](https://github.com/DistrictDataLabs/trinket/releases/tag/v0.1)
* **deployment**: Tuesday, October 13, 2015
* **commit**: [c863e42](https://github.com/DistrictDataLabs/trinket/commit/c863e421292be4eaeab36a9233f6ed7e0068679b)

MVP prototype type of a dataset uploader and management application. This application framework will become the basis for the research project in the DDL Multidimensional Visualization Research Labs. For now users can upload datasets, and manage their description, as well as preview the first 20 rows.

<!-- References -->
[travis_img]: https://travis-ci.org/DistrictDataLabs/trinket.svg?branch=master
[travis_href]: https://travis-ci.org/DistrictDataLabs/trinket
[coveralls_img]: https://coveralls.io/repos/DistrictDataLabs/trinket/badge.svg?branch=master&service=github
[coverals_href]: https://coveralls.io/github/DistrictDataLabs/trinket?branch=master
[waffle_img]: https://badge.waffle.io/DistrictDataLabs/trinket.png?label=ready&title=Ready
[waffle_href]: https://waffle.io/DistrictDataLabs/trinket
[rtfd_img]: https://readthedocs.org/projects/trinket/badge/?version=latest
[rtfd_href]: http://trinket.readthedocs.org/en/latest/?badge=latest
[wall.jpg]: https://flic.kr/p/75C2ac
17 changes: 16 additions & 1 deletion coffer/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from django.db import IntegrityError
from braces.views import LoginRequiredMixin
from django.views.generic.list import ListView
from django.views.generic.edit import FormView
from django.views.generic.detail import DetailView

Expand Down Expand Up @@ -62,7 +63,21 @@ def get_context_data(self, **kwargs):
return context


class DatasetListView(LoginRequiredMixin, ListView):

model = Dataset
template_name = "coffer/dataset_list.html"
paginate_by = 25
context_object_name = "dataset_list"

def get_context_data(self, **kwargs):
context = super(DatasetListView, self).get_context_data(**kwargs)
context['num_datasets'] = Dataset.objects.count()
context['latest_dataset'] = Dataset.objects.latest().created
return context


class DatasetDetailView(LoginRequiredMixin, DetailView):

template_name = "site/dataset.html"
template_name = "coffer/dataset_detail.html"
model = Dataset
65 changes: 65 additions & 0 deletions docs/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# About Trinket

Trinket is a dataset management, analysis and visualization tool that is being built as part of the DDL Multidimensional Visualization Research Lab. See: [Parallel Coordinates](http://homes.cs.washington.edu/~jheer//files/zoo/ex/stats/parallel.html) for more on the types of visualizations we're experimenting with.

## Contributing

Trinket is open source, but because this is an District Data Labs project, we would appreciate it if you would let us know how you intend to use the software (other than simply copying and pasting code so that you can use it in your own projects). If you would like to contribute (especially if you are a student or research labs member at District Data Labs), you can do so in the following ways:

1. Add issues or bugs to the bug tracker: [https://github.com/DistrictDataLabs/trinket/issues](https://github.com/DistrictDataLabs/trinket/issues)
2. Work on a card on the dev board: [https://waffle.io/DistrictDataLabs/trinket](https://waffle.io/DistrictDataLabs/trinket)
3. Create a pull request in Github: [https://github.com/DistrictDataLabs/trinket/pulls](https://github.com/DistrictDataLabs/trinket/pulls)

Note that labels in the Github issues are defined in the blog post: [How we use labels on GitHub Issues at Mediocre Laboratories](https://mediocre.com/forum/topics/how-we-use-labels-on-github-issues-at-mediocre-laboratories).

If you are a member of the District Data Labs Faculty group, you have direct access to the repository, which is set up in a typical production/release/development cycle as described in _[A Successful Git Branching Model](http://nvie.com/posts/a-successful-git-branching-model/)_. A typical workflow is as follows:

1. Select a card from the [dev board](https://waffle.io/DistrictDataLabs/trinket) - preferably one that is "ready" then move it to "in-progress".

2. Create a branch off of develop called "feature-[feature name]", work and commit into that branch.

~$ git checkout -b feature-myfeature develop

3. Once you are done working (and everything is tested) merge your feature into develop.

~$ git checkout develop
~$ git merge --no-ff feature-myfeature
~$ git branch -d feature-myfeature
~$ git push origin develop

4. Repeat. Releases will be routinely pushed into master via release branches, then deployed to the server.

## Contributors

Thank you for all your help contributing to make Trinket a great project!

### Maintainers

- Benjamin Bengfort: [@bbengfort](https://github.com/bbengfort/)
- Rebecca Bilbro: [@rebeccabilbro](https://github.com/rebeccabilbro)

### Contributors

- Tony Ojeda: [@ojedatony1616](https://github.com/ojedatony1616)

## Changelog

The release versions that are sent to the Python package index (PyPI) are also tagged in Github. You can see the tags through the Github web application and download the tarball of the version you'd like. Additionally PyPI will host the various releases of Trinket (eventually).

The versioning uses a three part version system, "a.b.c" - "a" represents a major release that may not be backwards compatible. "b" is incremented on minor releases that may contain extra features, but are backwards compatible. "c" releases are bug fixes or other micro changes that developers should feel free to immediately update to.

### Version 0.2

* **tag**: [v0.2](https://github.com/DistrictDataLabs/trinket/releases/tag/v0.2)
* **deployment**: Wednesday, January 27, 2016
* **commit**: (see tag)

This minor update gave a bit more functionality to the MVP prototype, even though the version was intended to have a much more impactful feature set. However after some study, the workflow is changing, and so this development branch is being pruned and deployed in preparation for the next batch. The major achievement of this version is the documentation that discusses our approach, as well as the dataset search and listing page that is now available.

### Version 0.1

* **tag**: [v0.1](https://github.com/DistrictDataLabs/trinket/releases/tag/v0.1)
* **deployment**: Tuesday, October 13, 2015
* **commit**: [c863e42](https://github.com/DistrictDataLabs/trinket/commit/c863e421292be4eaeab36a9233f6ed7e0068679b)

MVP prototype type of a dataset uploader and management application. This application framework will become the basis for the research project in the DDL Multidimensional Visualization Research Labs. For now users can upload datasets, and manage their description, as well as preview the first 20 rows.
98 changes: 98 additions & 0 deletions docs/auto_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Automated Analysis

## Overview

Trinket is designed to mirror what experienced data scientists do when they take their first few passes through a new dataset by intelligently automating large portions of the wrangling and analysis/exploration phases of the data science pipeline, integrating them into the initial ingestion or uploading phase.

## Architecture

The auto-analysis and text parsing features of Trinket are written in Python. They work by scanning columns of uploaded data and using `numpy`, `unicodecsv`, one-dimensional kernel density estimates, standard analyses of variance mechanisms and hypothesis testing (KDEs, ANOVAs).

![Seed dataset](images/data_set.png)

This enables Trinket to do type identification, e.g. to identify and differentiate: discrete integers, floats, text data, normal distributions, classes, outliers, and errors. To perform this analysis quickly and accurately during the data ingestion process, Trinket includes a rules-based system trained from previously annotated data sets and coupled with heuristic rules determined in discussions with a range of experienced data scientists.

## Mechanics

Auto-analysis works by assigning each column/feature a data type (`dtype` in the parlance of NumPy and Pandas), e.g. categorical, numeric, real, integer, etc. These types must be automatically inferred from the dataset.

The auto-analysis method takes as input a file-like object and generic keyword arguments and returns as output a tuple/list whose length is the (maximum) number of columns in the dataset, and whose values contain the datatype of each column, ordered by column index.


_Questions to answer:_

- How do other libraries like `pandas` and `messytables` do this?
Pandas computes [histograms](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L250), looks for the [min](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L537) and [max](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L556) values of a column, samples [quantiles](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L410), and counts [unique values](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L55).

- Do you have to go through the whole dataset to make a decision?
Yes and no - decide based on how big the dataset is. The below strategy builds a sample from 50 non-empty rows for each column, as well as the rows with the longest and shortest lengths. For larger datasets, maybe sample 10%. For extremely large datasets, 1% might be enough.

- Can we use a sampling approach to reading the data?
Naive method (assumes straightforward densities):

```python
for each col in fileTypeObject:
find mx # row with the longest value
find mn # row with the shortest value
find nonNaN # first 50 non-empty rows using ndarray.nonzero()
sampleArray = nd.array(mn, mx, nonNaN)
```

- Is there a certain density of data required to make a decision?
This is a good question - some libraries build histograms for each column to examine densities. See the [`pandas` method for histograms](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L250).
TODO: look into thresholds

- What types are we looking for?
__string__, __datetime__, __float__, __integer__, __boolean__
See also [`messytables` types](https://github.com/okfn/messytables/blob/master/messytables/types.py).

Attempt parsing from broadest type to narrowest:

```python
for val in colSample:
if val.dtype.type is np.string_:
colType = colType.astype('Sn') # where n is the max length value in col
elif val.dtype.type is np.datetime64:
colType = colType.astype('datetime64') # this is new & experimental in NumPy 1.7.0
elif val.dtype.type is np.float_:
colType = colType.astype('float64')
elif val.dtype.type is np.int_:
colType = colType.astype('int64')
elif val.dtype.type is np.bool_:
colType = colType.astype('bool')
else:
# do something else
# what about unicode and complex types?
```

- What does column-major mean for Trinket?
Use [`transpose`](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ndarray.T.html) and/or [`reshape`](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.reshape.html) from `numpy`.

- Can we automatically detect delimiters and quote characters? (e.g. ; vs ,)
See `messytables` [method for delimiter detection](https://github.com/okfn/messytables/blob/master/messytables/commas.py).

- How do we detect if there is a header row or not?
See `messytables` [method for header detection](https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/headers.py).

- How lightweight/heavyweight must this be?
Look into making more lightweight using regular expressions & hard-coded rules (see [Brill tagging](https://en.wikipedia.org/wiki/Brill_tagger)).

## Sources

[Datatypes in Python - 2.7](https://docs.python.org/2/library/datatypes.html)

[Datatypes in Python - 3.5](https://docs.python.org/3.5/library/datatypes.html)

[Numpy - dtypes](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)

[UnicodeCSV](https://github.com/jdunck/python-unicodecsv/blob/master/README.rst)

[Pandas](http://pandas.pydata.org/)

[MessyTables](https://messytables.readthedocs.org/en/latest/)

[Dataproxy](https://github.com/okfn/dataproxy)

[Algorithms for Type Guessing - Stackoverflow](http://stackoverflow.com/questions/6824862/data-type-recognition-guessing-of-csv-data-in-python)

[Python Libraries for Type Guessing - Stackoverflow](http://stackoverflow.com/questions/3098337/method-for-guessing-type-of-data-represented-currently-represented-as-strings-in)
Binary file added docs/images/data_set.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/parallel_coords.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/radviz.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/scatter_matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Loading

0 comments on commit ba108e8

Please sign in to comment.