Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/Decouple Partitioning User API and Implementation #1521

Closed
newelh opened this issue Sep 25, 2023 · 3 comments
Closed

feat/Decouple Partitioning User API and Implementation #1521

newelh opened this issue Sep 25, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@newelh
Copy link
Contributor

newelh commented Sep 25, 2023

Description

The user API is tightly bound to the internal implementation of the partitioners. As a result, changes in keyword arguments or function behavior immediately impact the user API. Users have direct access to the implementation functions, as indicated in the current documentation. This tight coupling restricts our flexibility to refactor or enhance the partitioners without introducing breaking changes.

The proposed changes are intended to increase the speed of development and eventually provide the ability to introduce new user facing APIs in the future.

Proposed Solution

Introduce an abstraction layer between the user API and the partitioner implementations. This layer would expose only the necessary functionalities and delegate to the appropriate partitioners. This decoupling allows for more flexibility in internal changes without affecting the user API. A phased approach would be needed, where this issue would be the first step of the following:

  1. Move the contents of the partition function for each partitioner into a private function _partition_{type}. For example, the email partitioner. While this seems trivial, it will enable us to further develop the partitioners without obstructing APIs that users may have in production. It will enable us to simplify and consolidate functionality in the partitioners so that we can eventually support a simpler interface for users in the future.
@process_metadata()
@add_metadata_with_filetype(FileType.EML)
@add_chunking_strategy()
def partition_email(
    filename: Optional[str] = None,
    file: Optional[Union[IO[bytes], SpooledTemporaryFile]] = None,
    text: Optional[str] = None,
    content_source: str = "text/html",
    encoding: Optional[str] = None,
    include_headers: bool = False,
    max_partition: Optional[int] = 1500,
    include_metadata: bool = True,
    metadata_filename: Optional[str] = None,
    metadata_last_modified: Optional[str] = None,
    process_attachments: bool = False,
    attachment_partitioner: Optional[Callable] = None,
    min_partition: Optional[int] = 0,
    chunking_strategy: Optional[str] = None,
    **kwargs,
) -> List[Element]:
    """ Wrap the _partition_email function to separate user facing API from internal API"""
    return _partition_email(
        filename=filename,
        file=file,
        text=text,
        content_source=content_source,
        encoding=encoding,
        include_headers=include_headers,
        max_partition=max_partition,
        include_metadata=include_metadata,
        metadata_filename=metadata_filename,
        metadata_last_modified=metadata_last_modified,
        process_attachments=process_attachments,
        attachment_partitioner=attachment_partitioner,
        min_partition=min_partition,
        chunking_strategy=chunking_strategy,
        **kwargs,
    )

Follow-up phases/issues after completing phase 1:
2. Extract file reading into it's own function that passes a stream to the private partition functions
3. Move metadata processing from a decorator to a function call before/after partitioning. This allows us to support preprocessing steps for handling user input (e.g. language) and post-processing steps for consolidating processed element data (e.g. hierarchy, chunking) in a decoupled interface that can be better unit tested.
4. Move chunking from a decorator to a function call after partitioning
5. Write a new user facing api for accessing the simpler, decoupled components

All of this can be done incrementally, per partitioner. The end goal is to reduce the complexity of the keyword arguments in the user-facing API and reduce the coupling of downstream components for further feature development. In particular for document pre/post processing: metadata such as language and hierarchy, chunking, and document level metadata such as encoding, content source.

Alternatives

Do Nothing: Leave it as it is and continue with the tightly coupled system, accepting the maintenance and extensibility costs.

Additional Context

Unstructured already has an abstraction layer to some degree with the auto partitioner, the documentation does advocate though for using document-specific partitioners for increased speed, fewer dependencies and additional features.

@newelh newelh added the enhancement New feature or request label Sep 25, 2023
@scanny
Copy link
Collaborator

scanny commented Sep 25, 2023

A couple very nice concrete benefits of this that I see:

  1. Ease unit-testing. Decorators complicate unit testing because the "inner" function cannot be isolated from the decorator function. So the only alternative is to test the entire "glued-together" composition. Having a facade API function allows the core partitioner to be tested in isolation from the decorated API function.
  2. Allow flexible composition of implementation from distinct units. An API function naturally tends to become "fat" with keyword options as functionality is added. This is good for the user (if designed carefully) providing them a lot of flexibility. However, not all options are necessarily used by all parts of the implemention. We can see this already where some kwargs are used by decorators only. A distinct API function allows the implementation to be flexibly composed and recomposed from a growing set of distinct "step" or "aspect" implementor functions, each taking only the arguments they require. This improves our ability to respond to newly added function without struggling with a rigid implementation.
  3. Provide code-tree shape flexibility. As partitioner function modules begin to approach 1000 lines, it becomes time to think about transitioning each into a subpackage (directory). The distinct and smallish API function can be placed in the init.py module and the supporting implementation distributed to submodules in that directory.

I think other benefits are going to occur to me, but that's a start :)

@newelh
Copy link
Contributor Author

newelh commented Sep 25, 2023

Related: #1520

@newelh
Copy link
Contributor Author

newelh commented Sep 26, 2023

Related PR #1527

qued pushed a commit that referenced this issue Nov 1, 2023
### Summary
Closes #1520 
Partial solution to #1521 

- Adds an abstraction layer between the user API and the partitioner
implementation
- Adds comments explaining paragraph chunking
- Makes edits to pass strict type-checking for both text.py and
test_text.py
@MthwRobinson MthwRobinson closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants