Associations

Associations is a Python 3 library used to identify high-dimensional statistical relationships in any data set. This library is useful as a first-pass data analysis tool to understand:

A high-level overview of every potentional relationship in a data set.
Arbitrary dimensionality, identifying the relationships between every combination of elements.

This library assigns a relative "association" score for every n*m dimensional relationship of elements in the data set, and optionally outputs some graphs to illustrate those scores. You may use other libraries to apply more advanced or formalized statistical models to understand the finer details of those relationships.

This library has not had any changes since 2017, so it is due for an upgrade. Please open any issues or pull requests to make improvements. In the future, I may make significant breaking changes or fully replace much of the library functionality.

Installation

The latest release can be found here. For a direct download of the development version (latest revision, not latest release), click here.

Dependencies: NumPy, matplotlib, multiprocessing
Build Dependencies: git (only for Arch Linux python-associations-git package)

This is not compatible with Python 2.

Universal

Run this if you would like to install associations directly into Python without the use of a package manager. This should be compatible with any system.

$ python setup.py sdist
# python setup.py

Arch Linux

If you would like to install the latest development version (latest revision), you should install the python-associations-git package. All you have to do is download the PKGBUILD and the ABS will automatically download the source and install the package. You can keep reusing the same PKGBUILD. It will automatically update the version number based on the revision.

To install python-associations-git, run this in an empty directory:

$ wget https://raw.githubusercontent.com/dnut/PKGBUILDs/master/assocations/python-associations-git/PKGBUILD
$ makepkg -si

If you would like to install the latest release, you should install the python-associations package. You can download it here and install using the included PKGBUILD. To update this package, you will need to download the release from that page.

To install python-associations, cd into the python-associations directory and run this command:

$ makepkg -si

You can also download the source code for the latest revision manually and install using the included PKGBUILD. I only recommend this if you are either contributing to development or forking your own local version of the package.

Overview

We can count occurrences with a histogram, find associations between different fields, and are provided tools that aid in the analysis of the resultant data.

libassoc.py

This file contains the most generic procedures that do not belong in any created classes. They are convenient procedures for Python's fundamental data structures.

histogram.py - Histogram()

The primary job of a Histogram() object is to traverse a CSV file and create a NumPy array with as many dimensions as fields we wish to record and to fill that array with the count for every possible occurrence. This is accomplished with the count() method. Access to the internal data structure is provided via the get() method.

Attribute	Description
`fields`	Table fields that we want to measure.
`histogram`	NumPy array containing counts.
`valists`	List of lists containing strings of each field's values.
`valdicts`	List of dicts, inverted valists (key = string, val = int)
`valists_dict`	Dict of valists keyed by field names.
`valdicts_dict`	Dict of valdicts keyed by field names.
`field_index`	Keys field values to field names.
`field_index_int`	Keys field values to valists/valdicts index (int).
`nonzero_indices`	Indices for all nonzero values in the histogram.

Method	Description
`count()`	Count all occurrences for every possible situation
`useful_stuff()`	Expose the string values for quantitative internal data structure.
`reduce()`	Return new `Histogram()` with provided numpy array. Used by `simplify()` and `slice()`.
`simplify()`	Return new `Histogram()` with fewer dimensions by summing undesired dimensions. For example, create a histogram that drops the sex dimension. All remaining fields have combined value for both male and female.
`slice()`	Return new `Histogram()` with fewer dimensions by isolating a specific situation. For example, create a histogram representing only males with no data for females.
`nonzeros()`	Generator function that iterates through every nonzero element, optionally providing string representations.
`get()`	Retrieve count for any field value combination.

associations.py

Contains two classes that serve to identify associations in a Histogram(). Associator() finds associations for a specific field combination and Associations() uses Associator() objects to find all associations.

Associator() is a distinct class rather than integrating its methods into Associations() because Associations() uses multiprocessing to dramatically improve execution time on multi-core systems, and it needs relatively isolated objects to be passed to subprocesses. This implementation is intended to be superior to the redundancy of many Associations() objects or the complexity of queues and pipes without hurting code legibility or efficiency.

Associator()

The associator object identifies associations between different field values (eg. fatalities and amputations) by comparing one group to a larger group that encompasses it.

Knowing that white males are injured on Tuesday more frequently than black males is not very useful information because it is likely caused by there being more white males than black. Furthermore, knowing that while males are more injured on Tuesday than other days doesn't tell us whether or not white males and Tuesday are associated because it may be that Tuesdays have more injuries overall. Therefore, we must establish a standardized numerical value that represents the actual association between two fields by taking into consideration the overall populations we are sampling from.

As another example, if we want to find the association between amputations and fatalities (diagnosis and disposition), we need to take the same approach. While the likelihood that an amputation is fatal is valuable information, we are more interested in the relative fatality of different diagnoses. Amputations may have a very low likelihood of fatality, but we must compare it to the likelihood that any other diagnosis leads to fatality before we discover whether amputations are relatively likely to be fatal. Therefore, we must take into consideration the extreme infrequency of fatalities in general to get a standardized numerical representation of how associated each field is.

There are two approaches to resolve our dilemma that are mathematically equivalent. One approach is to divide the number of fatal amputations by the number of amputations with any disposition, which yields the likelihood that an amputation is fatal. But we want to normalize this likelihood by scaling it according to the likelihood that anything my be fatal. To do so, we divide them (total fatalities / total of everything) and that yields the association ratio between amputation and fatality.

Identical results would be reached by first dividing fatal amputations by all fatalities (likelihood that a fatality is caused by amputation) and then dividing that by the average likelihood that an amputation is the cause of any disposition (total amputations / total of everything). This results in the exact same association ratio as the first approach.

Both approaches are the same algorithm run in opposite directions. They are also mathematically equivalent since they both result in the same calculation:

association between amputations and fatalities = (fatal amputations)*(total of everything) / (fatalities)*(amputations)

Originally, for efficiency, I used a specialized version of the aforementioned algorithm (calculate likelihoods then divide) in order to naturally cache totals and subtotals for multiple situations. Unfortunately, this led to a very complex and confusing algorithm.

To keep the algorithm simple, I have written a new one optimized to use the general formula as efficiently as possible. I have actually gotten it to be more efficient than the original algorithm. This algorithm is significantly less complex. It is more maintainable and easier to understand and use, so it is favored.

I still see some potential to optimize a few places in the algorithm to improve efficiency even further, but this would require a lot of benchmarking and will probably not be a huge improvement, so it is not my top priority.

Attribute	Description
`notable`	Minimum association ratio (or inverse) to be included.
`significant`	Minimum number of occurrences (statistical significance).
`assoc`	Associations organized by association then subgroup.
`subpops`	Associations organized by subgroup then association.
`hist`	`Histogram()` object to extract data from.

Method	Description
`add()`	Save association ratio.
`find()`	Find the association ratio for every field value combination among a specific field name combination.

Associations()

Attributes: self.pairs and self.subpops contain all association ratios.

>>> self.pairs
{
	pair_type: {
		frozenset(association_pair): {
			frozenset(subgroup/subpopulation): association_ratio
		}
	}
}

>>> self.subpops
{
	subgroup_type: {
		frozenset(subgroup/subpopulation): {
			frozenset(association_pair): association_ratio
		}
	}
}

Method	Description
`find_all()`	Use multiprocessing pool to test every field name combination using `Associator().find()`.
`helper()`	Runs `Associator().find()`. Needed for multiprocessing.
`add()`	Add entire `Associator()`'s data structures to `Associations()` object using `merge()`.
`merge()`	Lower level dictionary processor than `add()`.
`report()`	Report associations between two fields.
`subgroup_report()`	Report associations for any pairs within a subgroup/subpopulation.

analysis.py

Contains two classes, Analysis() and AsciiTable()

Analysis()

Analyze data from Histogram() and Associations().

Attribute	Description
`hist`	`Histogram()`
`assoc`	`Associations()`
`gen_assoc`	Average association ratios for combo types.
`maxes` and `mins`	Max and min association ratios for combo types.

Method	Description
`make_hist()`	Create data structure for a histogram plot.
`prep_hist()`	Used by `make_hist()` to include only notable data.
`plot_hist()`	Use data from `make_hist()` to create an actual plot.
`plot_assoc()`	Use `make_hist()` and `plot_hist()` for specific purpose of plotting association ratios between two field names.
`nice_plot_assoc()`	Try `plot_assoc()` with various `notable` values to create a legible plot containing meaningful data.
`plot_all()`	Run `nice_plot_assoc()` for every field combination.
`max_helper()`	Find `mins` and `maxes` while making hists.
`most_common()`	Most common occurrences.
`most_assoc()`	Most associated occurrences.
`extremes()`	Most associated occurrences (broader).

AsciiTable()

Attribute	Description
`tables`	List of table strings.

Method	Description
`table()`	Draw ascii table.
`table_section()`	Format data into a section to be interpreted by `table()`.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
associations		associations
python-associations		python-associations
.editorconfig		.editorconfig
LICENSE		LICENSE
README.md		README.md
changelog.md		changelog.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Associations

Installation

Universal

Arch Linux

Overview

libassoc.py

histogram.py - Histogram()

associations.py

Associator()

Associations()

analysis.py

Analysis()

AsciiTable()

About

Releases 3

Packages

Languages

License

dnut/associations

Folders and files

Latest commit

History

Repository files navigation

Associations

Installation

Universal

Arch Linux

Overview

libassoc.py

histogram.py - Histogram()

associations.py

Associator()

Associations()

analysis.py

Analysis()

AsciiTable()

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages