-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
27 changed files
with
618 additions
and
154 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# About Trinket | ||
|
||
Trinket is a dataset management, analysis and visualization tool that is being built as part of the DDL Multidimensional Visualization Research Lab. See: [Parallel Coordinates](http://homes.cs.washington.edu/~jheer//files/zoo/ex/stats/parallel.html) for more on the types of visualizations we're experimenting with. | ||
|
||
## Contributing | ||
|
||
Trinket is open source, but because this is an District Data Labs project, we would appreciate it if you would let us know how you intend to use the software (other than simply copying and pasting code so that you can use it in your own projects). If you would like to contribute (especially if you are a student or research labs member at District Data Labs), you can do so in the following ways: | ||
|
||
1. Add issues or bugs to the bug tracker: [https://github.com/DistrictDataLabs/trinket/issues](https://github.com/DistrictDataLabs/trinket/issues) | ||
2. Work on a card on the dev board: [https://waffle.io/DistrictDataLabs/trinket](https://waffle.io/DistrictDataLabs/trinket) | ||
3. Create a pull request in Github: [https://github.com/DistrictDataLabs/trinket/pulls](https://github.com/DistrictDataLabs/trinket/pulls) | ||
|
||
Note that labels in the Github issues are defined in the blog post: [How we use labels on GitHub Issues at Mediocre Laboratories](https://mediocre.com/forum/topics/how-we-use-labels-on-github-issues-at-mediocre-laboratories). | ||
|
||
If you are a member of the District Data Labs Faculty group, you have direct access to the repository, which is set up in a typical production/release/development cycle as described in _[A Successful Git Branching Model](http://nvie.com/posts/a-successful-git-branching-model/)_. A typical workflow is as follows: | ||
|
||
1. Select a card from the [dev board](https://waffle.io/DistrictDataLabs/trinket) - preferably one that is "ready" then move it to "in-progress". | ||
|
||
2. Create a branch off of develop called "feature-[feature name]", work and commit into that branch. | ||
|
||
~$ git checkout -b feature-myfeature develop | ||
|
||
3. Once you are done working (and everything is tested) merge your feature into develop. | ||
|
||
~$ git checkout develop | ||
~$ git merge --no-ff feature-myfeature | ||
~$ git branch -d feature-myfeature | ||
~$ git push origin develop | ||
|
||
4. Repeat. Releases will be routinely pushed into master via release branches, then deployed to the server. | ||
|
||
## Contributors | ||
|
||
Thank you for all your help contributing to make Trinket a great project! | ||
|
||
### Maintainers | ||
|
||
- Benjamin Bengfort: [@bbengfort](https://github.com/bbengfort/) | ||
- Rebecca Bilbro: [@rebeccabilbro](https://github.com/rebeccabilbro) | ||
|
||
### Contributors | ||
|
||
- Tony Ojeda: [@ojedatony1616](https://github.com/ojedatony1616) | ||
|
||
## Changelog | ||
|
||
The release versions that are sent to the Python package index (PyPI) are also tagged in Github. You can see the tags through the Github web application and download the tarball of the version you'd like. Additionally PyPI will host the various releases of Trinket (eventually). | ||
|
||
The versioning uses a three part version system, "a.b.c" - "a" represents a major release that may not be backwards compatible. "b" is incremented on minor releases that may contain extra features, but are backwards compatible. "c" releases are bug fixes or other micro changes that developers should feel free to immediately update to. | ||
|
||
### Version 0.2 | ||
|
||
* **tag**: [v0.2](https://github.com/DistrictDataLabs/trinket/releases/tag/v0.2) | ||
* **deployment**: Wednesday, January 27, 2016 | ||
* **commit**: (see tag) | ||
|
||
This minor update gave a bit more functionality to the MVP prototype, even though the version was intended to have a much more impactful feature set. However after some study, the workflow is changing, and so this development branch is being pruned and deployed in preparation for the next batch. The major achievement of this version is the documentation that discusses our approach, as well as the dataset search and listing page that is now available. | ||
|
||
### Version 0.1 | ||
|
||
* **tag**: [v0.1](https://github.com/DistrictDataLabs/trinket/releases/tag/v0.1) | ||
* **deployment**: Tuesday, October 13, 2015 | ||
* **commit**: [c863e42](https://github.com/DistrictDataLabs/trinket/commit/c863e421292be4eaeab36a9233f6ed7e0068679b) | ||
|
||
MVP prototype type of a dataset uploader and management application. This application framework will become the basis for the research project in the DDL Multidimensional Visualization Research Labs. For now users can upload datasets, and manage their description, as well as preview the first 20 rows. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# Automated Analysis | ||
|
||
## Overview | ||
|
||
Trinket is designed to mirror what experienced data scientists do when they take their first few passes through a new dataset by intelligently automating large portions of the wrangling and analysis/exploration phases of the data science pipeline, integrating them into the initial ingestion or uploading phase. | ||
|
||
## Architecture | ||
|
||
The auto-analysis and text parsing features of Trinket are written in Python. They work by scanning columns of uploaded data and using `numpy`, `unicodecsv`, one-dimensional kernel density estimates, standard analyses of variance mechanisms and hypothesis testing (KDEs, ANOVAs). | ||
|
||
![Seed dataset](images/data_set.png) | ||
|
||
This enables Trinket to do type identification, e.g. to identify and differentiate: discrete integers, floats, text data, normal distributions, classes, outliers, and errors. To perform this analysis quickly and accurately during the data ingestion process, Trinket includes a rules-based system trained from previously annotated data sets and coupled with heuristic rules determined in discussions with a range of experienced data scientists. | ||
|
||
## Mechanics | ||
|
||
Auto-analysis works by assigning each column/feature a data type (`dtype` in the parlance of NumPy and Pandas), e.g. categorical, numeric, real, integer, etc. These types must be automatically inferred from the dataset. | ||
|
||
The auto-analysis method takes as input a file-like object and generic keyword arguments and returns as output a tuple/list whose length is the (maximum) number of columns in the dataset, and whose values contain the datatype of each column, ordered by column index. | ||
|
||
|
||
_Questions to answer:_ | ||
|
||
- How do other libraries like `pandas` and `messytables` do this? | ||
Pandas computes [histograms](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L250), looks for the [min](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L537) and [max](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L556) values of a column, samples [quantiles](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L410), and counts [unique values](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L55). | ||
|
||
- Do you have to go through the whole dataset to make a decision? | ||
Yes and no - decide based on how big the dataset is. The below strategy builds a sample from 50 non-empty rows for each column, as well as the rows with the longest and shortest lengths. For larger datasets, maybe sample 10%. For extremely large datasets, 1% might be enough. | ||
|
||
- Can we use a sampling approach to reading the data? | ||
Naive method (assumes straightforward densities): | ||
|
||
```python | ||
for each col in fileTypeObject: | ||
find mx # row with the longest value | ||
find mn # row with the shortest value | ||
find nonNaN # first 50 non-empty rows using ndarray.nonzero() | ||
sampleArray = nd.array(mn, mx, nonNaN) | ||
``` | ||
|
||
- Is there a certain density of data required to make a decision? | ||
This is a good question - some libraries build histograms for each column to examine densities. See the [`pandas` method for histograms](https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L250). | ||
TODO: look into thresholds | ||
|
||
- What types are we looking for? | ||
__string__, __datetime__, __float__, __integer__, __boolean__ | ||
See also [`messytables` types](https://github.com/okfn/messytables/blob/master/messytables/types.py). | ||
|
||
Attempt parsing from broadest type to narrowest: | ||
|
||
```python | ||
for val in colSample: | ||
if val.dtype.type is np.string_: | ||
colType = colType.astype('Sn') # where n is the max length value in col | ||
elif val.dtype.type is np.datetime64: | ||
colType = colType.astype('datetime64') # this is new & experimental in NumPy 1.7.0 | ||
elif val.dtype.type is np.float_: | ||
colType = colType.astype('float64') | ||
elif val.dtype.type is np.int_: | ||
colType = colType.astype('int64') | ||
elif val.dtype.type is np.bool_: | ||
colType = colType.astype('bool') | ||
else: | ||
# do something else | ||
# what about unicode and complex types? | ||
``` | ||
|
||
- What does column-major mean for Trinket? | ||
Use [`transpose`](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ndarray.T.html) and/or [`reshape`](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.reshape.html) from `numpy`. | ||
|
||
- Can we automatically detect delimiters and quote characters? (e.g. ; vs ,) | ||
See `messytables` [method for delimiter detection](https://github.com/okfn/messytables/blob/master/messytables/commas.py). | ||
|
||
- How do we detect if there is a header row or not? | ||
See `messytables` [method for header detection](https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/headers.py). | ||
|
||
- How lightweight/heavyweight must this be? | ||
Look into making more lightweight using regular expressions & hard-coded rules (see [Brill tagging](https://en.wikipedia.org/wiki/Brill_tagger)). | ||
|
||
## Sources | ||
|
||
[Datatypes in Python - 2.7](https://docs.python.org/2/library/datatypes.html) | ||
|
||
[Datatypes in Python - 3.5](https://docs.python.org/3.5/library/datatypes.html) | ||
|
||
[Numpy - dtypes](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html) | ||
|
||
[UnicodeCSV](https://github.com/jdunck/python-unicodecsv/blob/master/README.rst) | ||
|
||
[Pandas](http://pandas.pydata.org/) | ||
|
||
[MessyTables](https://messytables.readthedocs.org/en/latest/) | ||
|
||
[Dataproxy](https://github.com/okfn/dataproxy) | ||
|
||
[Algorithms for Type Guessing - Stackoverflow](http://stackoverflow.com/questions/6824862/data-type-recognition-guessing-of-csv-data-in-python) | ||
|
||
[Python Libraries for Type Guessing - Stackoverflow](http://stackoverflow.com/questions/3098337/method-for-guessing-type-of-data-represented-currently-represented-as-strings-in) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Oops, something went wrong.