-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/Decouple Partitioning User API and Implementation #1521
Labels
enhancement
New feature or request
Comments
A couple very nice concrete benefits of this that I see:
I think other benefits are going to occur to me, but that's a start :) |
Related: #1520 |
Related PR #1527 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
The user API is tightly bound to the internal implementation of the partitioners. As a result, changes in keyword arguments or function behavior immediately impact the user API. Users have direct access to the implementation functions, as indicated in the current documentation. This tight coupling restricts our flexibility to refactor or enhance the partitioners without introducing breaking changes.
The proposed changes are intended to increase the speed of development and eventually provide the ability to introduce new user facing APIs in the future.
Proposed Solution
Introduce an abstraction layer between the user API and the partitioner implementations. This layer would expose only the necessary functionalities and delegate to the appropriate partitioners. This decoupling allows for more flexibility in internal changes without affecting the user API. A phased approach would be needed, where this issue would be the first step of the following:
_partition_{type}
. For example, the email partitioner. While this seems trivial, it will enable us to further develop the partitioners without obstructing APIs that users may have in production. It will enable us to simplify and consolidate functionality in the partitioners so that we can eventually support a simpler interface for users in the future.Follow-up phases/issues after completing phase 1:
2. Extract file reading into it's own function that passes a stream to the private partition functions
3. Move metadata processing from a decorator to a function call before/after partitioning. This allows us to support preprocessing steps for handling user input (e.g. language) and post-processing steps for consolidating processed element data (e.g. hierarchy, chunking) in a decoupled interface that can be better unit tested.
4. Move chunking from a decorator to a function call after partitioning
5. Write a new user facing api for accessing the simpler, decoupled components
All of this can be done incrementally, per partitioner. The end goal is to reduce the complexity of the keyword arguments in the user-facing API and reduce the coupling of downstream components for further feature development. In particular for document pre/post processing: metadata such as language and hierarchy, chunking, and document level metadata such as encoding, content source.
Alternatives
Do Nothing: Leave it as it is and continue with the tightly coupled system, accepting the maintenance and extensibility costs.
Additional Context
Unstructured already has an abstraction layer to some degree with the auto partitioner, the documentation does advocate though for using document-specific partitioners for increased speed, fewer dependencies and additional features.
The text was updated successfully, but these errors were encountered: