Text Data

A tool for exploring and analyzing text data

text_data is a lightweight, performant, and framework-independent tool for analyzing text data in Python. Requiring only numpy and IPython, it offers a suite of features that make it easy to search through documents, compute statistics on those documents, and visualize your findings.

Its key features are:

An inverted positional index allowing you to efficiently conduct boolean searches across a set of documents, including for arbitrarily long phrases.
A set of statistical calculations for analyzing text data, most of which work either as single point estimates or as vectorized calculations across an entire corpus.
A flexible design that makes it easy to split up your corpus so you can explore and try to find out what makes different portions of a corpus distinct.
An optional dependency that allows you to display data visualizations of your findings.

The visualizations and analyses you can conduct can run from the simple to the complex. At a simple level, you can compute the counts of individual words, mocking the behavior of collections.Counter. But you can also find the TF-IDF scores of every word-document pair in the corpus or manipulate numpy arrays to your heart's desire. On a visual front, you can do anything from building simple bar charts of word counts to building heatmaps showing the outputs of a machine learning model. This, for instance, shows the cosine distances between the doc2vec vectors in different State of the Union Addresses, using the Kaggle State of the Union Corpus:

Getting Started

There are two ways to install text_data. The minimal version of the tool can be installed with a simple pip install:

pip install text_data

If you want to install features for visualizing your findings, you can optionally install those using:

pip install text_data[display]

or

poetry add text_data -E display

A More Complete Pitch

My idea for this library stemmed from a story I wrote on the way that politicians message themselves in different platforms.

While I was working on that story, I noticed that I was writing a few functions that I felt could be generalized for other text-based analyses. Specifically, I spent a lot of my time writing code to display documents that matched certain phrases and and compute statistics comparing one set of documents to another. In addition, the code that I was writing was painfully inefficient.

This library aims to address all of these problems. Searching is as easy as collecting a list of raw text and a tokenizer, and creating a corpus. Using parallelized Rust code and an efficient indexing structure, the search results and vectorized statistical calculations return rapidly.

In addition, the code only relies on numpy and IPython, and its design is aimed to make it simple to use regardless of what other frameworks or libraries your analysis is using.

Anti-Pitch

The biggest drawback of this library is that it is not cache-efficient. All of the matrix calculations currently return numpy arrays, which are very sparse and thus pose more of a memory burden than they need to. In addition, the library does not have any support for storing data in SQL. As a result, it's not well suited for tasks in which you have large quantities of text data. (Eventually, I hope to change that.)

In addition, there are some frameworks whose authors have thought long and hard about how to help people explore text. (I've only used it a couple of times in a very exploratory manner, but fastai comes to mind.) If you're using one of those libraries already, you might be best off sticking with what you have.

Future Roadmap

There are a number of goals I have with text_data. First of all, I would love to see its set of supported statistical computations continually expand. If you have an idea of something that should be supported, please file an issue or fork the library and file a pull request.

More concretely, I want to add the following functionality:

Sparse matrix support in scipy. This will help address some of the current problems the library has with memory efficiency.
Support for SQL.
Support for word co-occurrence matrix calculations and support for boolean WITHIN searches.

And I'm very open to other ideas.

Contributing

There are two related code bases for this project. The first is the Python code base, while the second is the Rust code base. Assuming you have Python 3.7+ and poetry installed, you should be able to install the Python dependencies with

poetry install

In order to compile the Rust code, you will need to have Rust and Cargo installed. Installation instructions are available on Rust's website. Testing also relies on clippy, rustfmt, and cargo tarpaulin.

If you only want to mess around with the Python code and do not want to touch the Rust code, you can run

poetry run make lint-py
poetry run pytest

to lint and test the Python code. Running

poetry run make lint
poetry run make test

will run the full test suite.

Once you've changed the code, written tests, and added documentation to this project, file a pull request. That will trigger a GitHub action. If the tests and linting passes, I will review the code and decide whether to incorporate the code into the project.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
The structure of the Rust code is heavily modified from, but inspired by, this inverted index crate in Rust.
Some of the statistical functions and indexing functions from this crate were inspired by scikit, pandas, or nltk (in particular, the design of the TF-IDF matrix function, which is modeled after scikit's TfidfTransformer.)
The visualization functions were modified from examples on altair's website.
I relied on the design of the positional index from Introduction to Information Retrieval by Manning, Raghavan and Schütze.
"Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict" by Monroe, Colaresi, and Quinn formed the inspiration of some of the statistical computations and of the word frequency graphic.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.cargo		.cargo
.github		.github
docs		docs
examples		examples
images		images
proptest-regressions		proptest-regressions
src		src
tests		tests
text_data		text_data
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Data

Table of Contents

Getting Started

A More Complete Pitch

Anti-Pitch

Future Roadmap

Contributing

Credits

About

Releases 6

Packages

Languages

License

maxblee/text_data

Folders and files

Latest commit

History

Repository files navigation

Text Data

Table of Contents

Getting Started

A More Complete Pitch

Anti-Pitch

Future Roadmap

Contributing

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages