feat/Decouple Partitioning User API and Implementation #1521

newelh · 2023-09-25T15:33:21Z

Description

The user API is tightly bound to the internal implementation of the partitioners. As a result, changes in keyword arguments or function behavior immediately impact the user API. Users have direct access to the implementation functions, as indicated in the current documentation. This tight coupling restricts our flexibility to refactor or enhance the partitioners without introducing breaking changes.

The proposed changes are intended to increase the speed of development and eventually provide the ability to introduce new user facing APIs in the future.

Proposed Solution

Introduce an abstraction layer between the user API and the partitioner implementations. This layer would expose only the necessary functionalities and delegate to the appropriate partitioners. This decoupling allows for more flexibility in internal changes without affecting the user API. A phased approach would be needed, where this issue would be the first step of the following:

Move the contents of the partition function for each partitioner into a private function _partition_{type}. For example, the email partitioner. While this seems trivial, it will enable us to further develop the partitioners without obstructing APIs that users may have in production. It will enable us to simplify and consolidate functionality in the partitioners so that we can eventually support a simpler interface for users in the future.

@process_metadata()
@add_metadata_with_filetype(FileType.EML)
@add_chunking_strategy()
def partition_email(
    filename: Optional[str] = None,
    file: Optional[Union[IO[bytes], SpooledTemporaryFile]] = None,
    text: Optional[str] = None,
    content_source: str = "text/html",
    encoding: Optional[str] = None,
    include_headers: bool = False,
    max_partition: Optional[int] = 1500,
    include_metadata: bool = True,
    metadata_filename: Optional[str] = None,
    metadata_last_modified: Optional[str] = None,
    process_attachments: bool = False,
    attachment_partitioner: Optional[Callable] = None,
    min_partition: Optional[int] = 0,
    chunking_strategy: Optional[str] = None,
    **kwargs,
) -> List[Element]:
    """ Wrap the _partition_email function to separate user facing API from internal API"""
    return _partition_email(
        filename=filename,
        file=file,
        text=text,
        content_source=content_source,
        encoding=encoding,
        include_headers=include_headers,
        max_partition=max_partition,
        include_metadata=include_metadata,
        metadata_filename=metadata_filename,
        metadata_last_modified=metadata_last_modified,
        process_attachments=process_attachments,
        attachment_partitioner=attachment_partitioner,
        min_partition=min_partition,
        chunking_strategy=chunking_strategy,
        **kwargs,
    )

Follow-up phases/issues after completing phase 1:
2. Extract file reading into it's own function that passes a stream to the private partition functions
3. Move metadata processing from a decorator to a function call before/after partitioning. This allows us to support preprocessing steps for handling user input (e.g. language) and post-processing steps for consolidating processed element data (e.g. hierarchy, chunking) in a decoupled interface that can be better unit tested.
4. Move chunking from a decorator to a function call after partitioning
5. Write a new user facing api for accessing the simpler, decoupled components

All of this can be done incrementally, per partitioner. The end goal is to reduce the complexity of the keyword arguments in the user-facing API and reduce the coupling of downstream components for further feature development. In particular for document pre/post processing: metadata such as language and hierarchy, chunking, and document level metadata such as encoding, content source.

Alternatives

Do Nothing: Leave it as it is and continue with the tightly coupled system, accepting the maintenance and extensibility costs.

Additional Context

Unstructured already has an abstraction layer to some degree with the auto partitioner, the documentation does advocate though for using document-specific partitioners for increased speed, fewer dependencies and additional features.

The text was updated successfully, but these errors were encountered:

scanny · 2023-09-25T16:39:48Z

A couple very nice concrete benefits of this that I see:

Ease unit-testing. Decorators complicate unit testing because the "inner" function cannot be isolated from the decorator function. So the only alternative is to test the entire "glued-together" composition. Having a facade API function allows the core partitioner to be tested in isolation from the decorated API function.
Allow flexible composition of implementation from distinct units. An API function naturally tends to become "fat" with keyword options as functionality is added. This is good for the user (if designed carefully) providing them a lot of flexibility. However, not all options are necessarily used by all parts of the implemention. We can see this already where some kwargs are used by decorators only. A distinct API function allows the implementation to be flexibly composed and recomposed from a growing set of distinct "step" or "aspect" implementor functions, each taking only the arguments they require. This improves our ability to respond to newly added function without struggling with a rigid implementation.
Provide code-tree shape flexibility. As partitioner function modules begin to approach 1000 lines, it becomes time to think about transitioning each into a subpackage (directory). The distinct and smallish API function can be placed in the init.py module and the supporting implementation distributed to submodules in that directory.

I think other benefits are going to occur to me, but that's a start :)

newelh · 2023-09-25T17:57:40Z

Related: #1520

newelh · 2023-09-26T17:04:49Z

Related PR #1527

### Summary Closes #1520 Partial solution to #1521 - Adds an abstraction layer between the user API and the partitioner implementation - Adds comments explaining paragraph chunking - Makes edits to pass strict type-checking for both text.py and test_text.py

newelh added the enhancement New feature or request label Sep 25, 2023

newelh mentioned this issue Oct 11, 2023

enhancement proposal: refactor elif statements for partitioners in auto.py #1716

Closed

Coniferish mentioned this issue Oct 25, 2023

refactor text.py #1872

Merged

MthwRobinson added the needs follow up label May 15, 2024

MthwRobinson closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/Decouple Partitioning User API and Implementation #1521

feat/Decouple Partitioning User API and Implementation #1521

newelh commented Sep 25, 2023 •

edited

Loading

scanny commented Sep 25, 2023 •

edited

Loading

newelh commented Sep 25, 2023

newelh commented Sep 26, 2023

feat/Decouple Partitioning User API and Implementation #1521

feat/Decouple Partitioning User API and Implementation #1521

Comments

newelh commented Sep 25, 2023 • edited Loading

Description

Proposed Solution

Alternatives

Additional Context

scanny commented Sep 25, 2023 • edited Loading

newelh commented Sep 25, 2023

newelh commented Sep 26, 2023

newelh commented Sep 25, 2023 •

edited

Loading

scanny commented Sep 25, 2023 •

edited

Loading