Skip to content
This repository has been archived by the owner on Jan 6, 2025. It is now read-only.

[MRG + 1] Copyedit all documentation for Camelot #112

Merged
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ API Reference

.. module:: camelot

Main Interface
Main interface
--------------
.. autofunction:: camelot.read_pdf

Lower-Level Classes
Lower-level classes
-------------------

.. autoclass:: camelot.handlers.PDFHandler
Expand All @@ -21,7 +21,7 @@ Lower-Level Classes
.. autoclass:: camelot.parsers.Lattice
:inherited-members:

Lower-Lower-Level Classes
Lower-lower-level classes
-------------------------

.. autoclass:: camelot.core.TableList
Expand Down
26 changes: 13 additions & 13 deletions docs/dev/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This document will help you get started with contributing documentation, code, t

.. _Vinayak Mehta: https://vinayak-mehta.github.io

Code Of Conduct
The Code Of Conduct
---------------

The following quote sums up the **Code Of Conduct**.
Expand All @@ -24,7 +24,7 @@ As the `Requests Code Of Conduct`_ states, **all contributions are welcome**, as

.. _Requests Code Of Conduct: http://docs.python-requests.org/en/master/dev/contributing/#be-cordial

Your First Contribution
Your first contribution
-----------------------

A great way to start contributing to Camelot is to pick an issue tagged with the `Contributor Friendly`_ or the `Easy`_ tags. If you're unable to find a good first issue, feel free to contact the maintainer.
Expand All @@ -39,13 +39,13 @@ To install the dependencies needed for development, you can use pip::

$ pip install camelot-py[dev]

Pull Requests
Pull requests
-------------

Submit a Pull Request
Submit a pull request
^^^^^^^^^^^^^^^^^^^^^

The preferred workflow for contributing to Camelot is to fork the `project repository`_ on GitHub, clone, develop on a branch and then finally submit a pull request. Steps:
The preferred workflow for contributing to Camelot is to fork the `project repository`_ on GitHub, clone, develop on a branch and then finally submit a pull request. Here are the steps:

.. _project repository: https://github.com/socialcopsdev/camelot

Expand Down Expand Up @@ -76,7 +76,7 @@ Now it's time to go to the your fork of Camelot and create a pull request! You c

.. _follow these instructions: https://help.github.com/articles/creating-a-pull-request-from-a-fork/

Work on your Pull Request
Work on your pull request
^^^^^^^^^^^^^^^^^^^^^^^^^

We recommend that your pull request complies with the following guidelines:
Expand All @@ -89,7 +89,7 @@ We recommend that your pull request complies with the following guidelines:

.. _numpydoc: https://numpydoc.readthedocs.io/en/latest/format.html

- Make sure your commit messages follow `the seven rules of a great git commit message`_.
- Make sure your commit messages follow `the seven rules of a great git commit message`_:
- Separate subject from body with a blank line
- Limit the subject line to 50 characters
- Capitalize the subject line
Expand All @@ -112,34 +112,34 @@ We recommend that your pull request complies with the following guidelines:

$ python setup.py test

Writing Documentation
Writing documentation
---------------------

Writing documentation, function docstrings, examples and tutorials is a great way to start contributing to open-source software! The documentation is present inside the ``docs/`` directory of the source code repository.

The documentation is written in `reStructuredText`_, with `Sphinx`_ used to generate these lovely HTML files that you're currently reading (unless you're reading this on GitHub). You can edit the documentation using any text editor and then generate the HTML output by running `make html` in the ``docs/`` directory.

The function docstrings are written using the `numpydoc`_ extension for Sphinx. Make sure you check out how its format guidelines, before you start writing one.
The function docstrings are written using the `numpydoc`_ extension for Sphinx. Make sure you check out how its format guidelines before you start writing one.

.. _reStructuredText: https://en.wikipedia.org/wiki/ReStructuredText
.. _Sphinx: http://www.sphinx-doc.org/en/master/
.. _numpydoc: https://numpydoc.readthedocs.io/en/latest/format.html

Filing Issues
Filing issues
-------------

We use `GitHub issues`_ to keep track of all issues and pull requests. Before opening an issue (which asks a question or reports a bug), it is advisable to use GitHub search to look for existing issues (both open and closed) that may be similar.
We use `GitHub issues`_ to keep track of all issues and pull requests. Before opening an issue (which asks a question or reports a bug), please use GitHub search to look for existing issues (both open and closed) that may be similar.

.. _GitHub issues: https://docs.pytest.org/en/latest/

Questions
^^^^^^^^^

Please don't use GitHub issues for support questions, a better place for them would be `Stack Overflow`_. Make sure you tag them using the ``python-camelot`` tag.
Please don't use GitHub issues for support questions. A better place for them would be `Stack Overflow`_. Make sure you tag them using the ``python-camelot`` tag.

.. _Stack Overflow: http://stackoverflow.com

Bug Reports
Bug reports
^^^^^^^^^^^

In bug reports, make sure you include:
Expand Down
18 changes: 9 additions & 9 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ Release v\ |version|. (:ref:`Installation <install>`)
.. image:: https://img.shields.io/pypi/pyversions/camelot-py.svg
:target: https://pypi.org/project/camelot-py/

**Camelot** is a Python library which makes it easy for *anyone* to extract tables from PDF files!
**Camelot** is a Python library that makes it easy for *anyone* to extract tables from PDF files!

----

**Here's how you can extract tables from PDF files.** Check out the PDF used in this example, `here`_.
**Here's how you can extract tables from PDF files.** Check out the PDF used in this example `here`_.

.. _here: _static/pdf/foo.pdf

Expand Down Expand Up @@ -55,15 +55,15 @@ Release v\ |version|. (:ref:`Installation <install>`)

There's a :ref:`command-line interface <cli>` too!

.. note:: Camelot only works with text-based PDFs and not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer, then your PDF is text-based.
.. note:: Camelot only works with text-based PDFs and not scanned documents. If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based.

Why Camelot?
------------

- **You are in control**: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (Since everything in the real world, including PDF table extraction, is fuzzy.)
- **Metrics**: *Bad* tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a **pandas DataFrame**, which enables seamless integration into `ETL and data analysis workflows`_.
- **Export** to multiple formats, including json, excel and html.
- **You are in control.** Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
- *Bad* tables can be discarded based on **metrics** like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a **pandas DataFrame**, which seamlessly integrates into `ETL and data analysis workflows`_.
- **Export** to multiple formats, including JSON, Excel and HTML.

See `comparison with other PDF table extraction libraries and tools`_.

Expand All @@ -73,7 +73,7 @@ See `comparison with other PDF table extraction libraries and tools`_.
The User Guide
--------------

This part of the documentation, begins with some background information about why Camelot was created, takes a small dip into the implementation details and then focuses on step-by-step instructions for getting the most out of Camelot.
This part of the documentation begins with some background information about why Camelot was created, takes a small dip into the implementation details and then focuses on step-by-step instructions for getting the most out of Camelot.

.. toctree::
:maxdepth: 2
Expand All @@ -85,7 +85,7 @@ This part of the documentation, begins with some background information about wh
user/advanced
user/cli

The API Documentation / Guide
The API Documentation/Guide
-----------------------------

If you are looking for information on a specific function, class, or method,
Expand Down
26 changes: 13 additions & 13 deletions docs/user/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This page covers some of the more advanced configurations for :ref:`Lattice <lat
Process background lines
------------------------

To detect line segments, :ref:`Lattice <lattice>` needs the lines that make the table, to be in foreground. Here's an example of a table with lines in background.
To detect line segments, :ref:`Lattice <lattice>` needs the lines that make the table to be in the foreground. Here's an example of a table with lines in the background:

.. figure:: ../_static/png/background_lines.png
:scale: 50%
Expand Down Expand Up @@ -68,16 +68,16 @@ Let's plot all the text present on the table's PDF page.
:alt: A plot of all text on a PDF page
:align: left

This, as we shall later see, is very helpful with :ref:`Stream <stream>`, for noting table areas and column separators, in case Stream does not guess them correctly.
This, as we shall later see, is very helpful with :ref:`Stream <stream>` for noting table areas and column separators, in case Stream does not guess them correctly.

.. note:: The *x-y* coordinates shown aboe change as you move your mouse cursor on the image, which can help you note coordinates.
.. note:: The *x-y* coordinates shown above change as you move your mouse cursor on the image, which can help you note coordinates.

.. _geometry_table:

table
^^^^^

Let's plot the table (to see if it was detected correctly or not). This geometry type, along with contour, line and joint is useful for debugging and improving the extraction output, in case the table wasn't detected correctly. More on that later.
Let's plot the table (to see if it was detected correctly or not). This geometry type, along with contour, line and joint is useful for debugging and improving the extraction output, in case the table wasn't detected correctly. (More on that later.)

::

Expand Down Expand Up @@ -170,9 +170,9 @@ In cases like `these <../_static/pdf/column_separators.pdf>`__, where the text i

You can pass the column separators as a list of comma-separated strings to :meth:`read_pdf() <camelot.read_pdf>`, using the ``columns`` keyword argument.

In case you passed a single column separators string list, and no table area is specified, the separators will be applied to the whole page. When a list of table areas is specified and there is a need to specify column separators as well, **the length of both lists should be equal**. Each table area will be mapped to each column separators' string using their indices.
In case you passed a single column separators string list, and no table area is specified, the separators will be applied to the whole page. When a list of table areas is specified and you need to specify column separators as well, **the length of both lists should be equal**. Each table area will be mapped to each column separators' string using their indices.

For example, if you have specified two table areas, ``table_areas=['12,23,43,54', '20,33,55,67']``, and only want to specify column separators for the first table, you can pass an empty string for the second table in the column separators' list, like this, ``columns=['10,120,200,400', '']``.
For example, if you have specified two table areas, ``table_areas=['12,23,43,54', '20,33,55,67']``, and only want to specify column separators for the first table, you can pass an empty string for the second table in the column separators' list like this, ``columns=['10,120,200,400', '']``.

Let's get back to the *x* coordinates we got from :ref:`plotting text <geometry_text>` that exists on this `PDF <../_static/pdf/column_separators.pdf>`__, and get the table out!

Expand All @@ -188,12 +188,12 @@ Let's get back to the *x* coordinates we got from :ref:`plotting text <geometry_
"NUMBER TYPE DBA NAME","","","LICENSEE NAME","ADDRESS","CITY","ST","ZIP","PHONE NUMBER","EXPIRES"
"...","...","...","...","...","...","...","...","...","..."

Ah! Since `PDFMiner <https://euske.github.io/pdfminer/>`_ merged the strings, "NUMBER", "TYPE" and "DBA NAME"; all of them were assigned to the same cell. Let's see how we can fix this in the next section.
Ah! Since `PDFMiner <https://euske.github.io/pdfminer/>`_ merged the strings, "NUMBER", "TYPE" and "DBA NAME", all of them were assigned to the same cell. Let's see how we can fix this in the next section.

Split text along separators
---------------------------

To deal with cases like the output from the previous section, you can pass ``split_text=True`` to :meth:`read_pdf() <camelot.read_pdf>`, which will split any strings that lie in different cells but have been assigned to the a single cell (as a result of being merged together by `PDFMiner <https://euske.github.io/pdfminer/>`_).
To deal with cases like the output from the previous section, you can pass ``split_text=True`` to :meth:`read_pdf() <camelot.read_pdf>`, which will split any strings that lie in different cells but have been assigned to a single cell (as a result of being merged together by `PDFMiner <https://euske.github.io/pdfminer/>`_).

::

Expand All @@ -210,13 +210,13 @@ To deal with cases like the output from the previous section, you can pass ``spl
Flag superscripts and subscripts
--------------------------------

There might be cases where you want to differentiate between the text, and superscripts or subscripts, like this `PDF <../_static/pdf/superscript.pdf>`_.
There might be cases where you want to differentiate between the text and superscripts or subscripts, like this `PDF <../_static/pdf/superscript.pdf>`_.

.. figure:: ../_static/png/superscript.png
:alt: A PDF with superscripts
:align: left

In this case, the text that `other tools`_ return, will be ``24.912``. This is harmless as long as there is that decimal point involved. But when it isn't there, you'll be left wondering why the results of your data analysis were 10x bigger!
In this case, the text that `other tools`_ return, will be ``24.912``. This is relatively harmless when that decimal point is involved. But when it isn't there, you'll be left wondering why the results of your data analysis are 10x bigger!

You can solve this by passing ``flag_size=True``, which will enclose the superscripts and subscripts with ``<s></s>``, based on font size, as shown below.

Expand Down Expand Up @@ -327,7 +327,7 @@ Voila! Camelot can now see those lines. Let's get our table.
Shift text in spanning cells
----------------------------

By default, the :ref:`Lattice <lattice>` method shifts text in spanning cells, first to the left and then to the top, as you can observe in the output table above. However, this behavior can be changed using the ``shift_text`` keyword argument. Think of it as setting the *gravity* for a table, it decides the direction in which the text will move and finally come to rest.
By default, the :ref:`Lattice <lattice>` method shifts text in spanning cells, first to the left and then to the top, as you can observe in the output table above. However, this behavior can be changed using the ``shift_text`` keyword argument. Think of it as setting the *gravity* for a table it decides the direction in which the text will move and finally come to rest.

``shift_text`` expects a list with one or more characters from the following set: ``('', l', 'r', 't', 'b')``, which are then applied *in order*. The default, as we discussed above, is ``['l', 't']``.

Expand Down Expand Up @@ -356,7 +356,7 @@ We'll use the `PDF <../_static/pdf/short_lines.pdf>`__ from the previous example
"Knowledge &Practices on HTN &","2400","Men (≥ 18 yrs)","-","-","-","1728"
"DM","2400","Women (≥ 18 yrs)","-","-","-","1728"

No surprises there, it did remain in place (observe the strings "2400" and "All the available individuals"). Let's pass ``shift_text=['r', 'b']``, to set the *gravity* to right-bottom, and move the text in that direction.
No surprises thereit did remain in place (observe the strings "2400" and "All the available individuals"). Let's pass ``shift_text=['r', 'b']`` to set the *gravity* to right-bottom and move the text in that direction.

::

Expand All @@ -380,7 +380,7 @@ No surprises there, it did remain in place (observe the strings "2400" and "All
Copy text in spanning cells
---------------------------

You can copy text in spanning cells when using :ref:`Lattice <lattice>`, in either horizontal or vertical direction, or both. This behavior is disabled by default.
You can copy text in spanning cells when using :ref:`Lattice <lattice>`, in either the horizontal or vertical direction, or both. This behavior is disabled by default.

``copy_text`` expects a list with one or more characters from the following set: ``('v', 'h')``, which are then applied *in order*.

Expand Down
4 changes: 2 additions & 2 deletions docs/user/cli.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
.. _cli:

Command-line interface
Command-Line Interface
======================

Camelot comes with a command-line interface.

You can print the help for the interface, by typing ``camelot --help`` in your favorite terminal program, as shown below. Furthermore, you can print the help for each command, by typing ``camelot <command> --help``, try it out!
You can print the help for the interface by typing ``camelot --help`` in your favorite terminal program, as shown below. Furthermore, you can print the help for each command by typing ``camelot <command> --help``. Try it out!

::

Expand Down
Loading