Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Extract catalog functionality from the core to make it more modular #10782

Open
alamb opened this issue Jun 3, 2024 · 33 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 3, 2024

Is your feature request related to a problem or challenge?

As @goldmedal started trying to move the DynamicFileProvider so others could use it in #10745 I think it is clear that there is not a good way to add additional catalog support in the core without everything being intertwined.

Thus I think we should try and extract the different catalog providers out of datafusion core so it it easier

Describe the solution you'd like

I suggest the following final layout:

  1. traits like CatalogProvider, SchemaProvider, etc in a new crate datafusion-catalog (since these traits rely on table provider, etc I think this can't be in datafusion-common or datafusion-expr)
  2. The built in Memory* providers are in datafusion-catalog
  3. The bult in InformationSchema providers are in datafusion-catalog
  4. The newly proposed DynamicFileCatalog in datafusion-catalog
  5. (eventually) the LIstingTableProvider (which is by far the most complicated) moved to its own crate datafusion-catalog-listing

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jun 3, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 3, 2024

If this seems like a reasonable idea to people I will file tickets to break down the work

cc @andygrove @jayzhan211 @comphead @mustafasrepo for your thoughts

@lewiszlw
Copy link
Member

lewiszlw commented Jun 4, 2024

I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.

@comphead
Copy link
Contributor

comphead commented Jun 4, 2024

Thanks @alamb for starting this discussion.
If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

Like @lewiszlw correctly mentioned, we got some coupling between providers and the core. I'm just trying to understand the usecase when providers needed without the core

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 5, 2024

Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate

@alamb
Copy link
Contributor Author

alamb commented Jun 5, 2024

If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)

So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API

Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate

I think both the complexity of ListingTable but also because if its dependency tree (e.g. parquet-rs and avro and json and object_store and ...)

For use cases like WASM it is quite messy to have the API split up like it currently is

@alamb
Copy link
Contributor Author

alamb commented Jun 5, 2024

I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.

I agree @lewiszlw -- well put. I made a first PR to start detangling things here: #10794 (it just splits SessionState into its own module)

Longer term we would have to figure out where SessionState would live (it still depends on several things in the core crate like datasource::provider and datasource::function 🤔

Maybe we could look into splitting out datafusion-datasource / datafusion-datasource-parquet / datafusion-datsource-avro, etc -- I don't have time to drive this at the moment but would be interested in helping anyone who did

@comphead
Copy link
Contributor

comphead commented Jun 5, 2024

If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)

So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API

Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.

@alamb
Copy link
Contributor Author

alamb commented Jun 6, 2024

Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.

Yes something like this -- I think most of the traits already exist (e.g. CatalogProvider) but figuring out how to decouple SessionState (which is referred to by CatalogProvider is the trickiest bit I think)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 11, 2024

I'm interesting on how datafusion can support vector index like Duckdb's VSS extension

Given the comment, it seems the issue here might be helpful 🤔 ?
I can work on this 👀

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 12, 2024

There is a dead lock that block us from extracting out catalog 😆

  1. To move CatalogProvider out we need to move TableProvider to datafusion-datasource
  2. To move TableProvider out we need to avoid dependency on SessionState for scan function, but there is CatalogProviderList in SessionState

SessionState contains tons of concept that are not all necessary for scan

pub struct SessionState {
    /// A unique UUID that identifies the session
    session_id: String,
    /// Responsible for analyzing and rewrite a logical plan before optimization
    analyzer: Analyzer,
    /// Provides support for customising the SQL planner, e.g. to add support for custom operators like `->>` or `?`
    expr_planners: Vec<Arc<dyn ExprPlanner>>,
    /// Responsible for optimizing a logical plan
    optimizer: Optimizer,
    /// Responsible for optimizing a physical execution plan
    physical_optimizers: PhysicalOptimizer,
    /// Responsible for planning `LogicalPlan`s, and `ExecutionPlan`
    query_planner: Arc<dyn QueryPlanner + Send + Sync>,
    /// Collection of catalogs containing schemas and ultimately TableProviders
    catalog_list: Arc<dyn CatalogProviderList>,
    /// Table Functions
    table_functions: HashMap<String, Arc<TableFunction>>,
    /// Scalar functions that are registered with the context
    scalar_functions: HashMap<String, Arc<ScalarUDF>>,
    /// Aggregate functions registered in the context
    aggregate_functions: HashMap<String, Arc<AggregateUDF>>,
    /// Window functions registered in the context
    window_functions: HashMap<String, Arc<WindowUDF>>,
    /// Deserializer registry for extensions.
    serializer_registry: Arc<dyn SerializerRegistry>,
    /// Holds registered external FileFormat implementations
    file_formats: HashMap<String, Arc<dyn FileFormatFactory>>,
    /// Session configuration
    config: SessionConfig,
    /// Table options
    table_options: TableOptions,
    /// Execution properties
    execution_props: ExecutionProps,
    /// TableProviderFactories for different file formats.
    ///
    /// Maps strings like "JSON" to an instance of  [`TableProviderFactory`]
    ///
    /// This is used to create [`TableProvider`] instances for the
    /// `CREATE EXTERNAL TABLE ... STORED AS <FORMAT>` for custom file
    /// formats other than those built into DataFusion
    ///
    /// [`TableProvider`]: crate::datasource::provider::TableProvider
    table_factories: HashMap<String, Arc<dyn TableProviderFactory>>,
    /// Runtime environment
    runtime_env: Arc<RuntimeEnv>,

    /// [FunctionFactory] to support pluggable user defined function handler.
    ///
    /// It will be invoked on `CREATE FUNCTION` statements.
    /// thus, changing dialect o PostgreSql is required
    function_factory: Option<Arc<dyn FunctionFactory>>,
}

@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2024

To move TableProvider out we need to avoid dependency on SessionState for scan function, but there is CatalogProviderList in SessionState

🤔 we could potentially move SessionState into the catalog crate

In fact the more I think about it, it seems to make some sort of sense to have a crate with the catalog traits / interfaces such as SessionState, Catalog, TableProvider that can be shared broadly.

Then we could make separate crates that actually had the implementations like

And we could put the individual datasources like

  • datafusion-datasource-parquet
  • datafusion-datasource-csv
  • datafusion-datasource-json
  • ...

🤔

I think #11320 and #11403 from @Omega359 makes this easier

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 13, 2024

datafusion-catalog: contains traits
datafusion-catalog-basic

datafusion-catalog datafusion-catalog-common: contains traits
datafusion-catalog-basic datafusion-catalog: actual implementation
🤔

Let's try moving SessionState, Catalog, TableProvider together into datafusion-catalog-common or common-runtime

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Update: #11403 has been merged and I think building SessionState is much nicer now

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 16, 2024

The dependencies in core is quite complex, take a note for it

Draft the dependency graph, incomplete

  graph TD;
      CatalogProvider --> TableProvider
      TableProvider --> SessionState
      TableProvider --> ExecutionPlan
      SessionState --> PhysicalOptimizer
      PhysicalOptimizer --> ExecutionPlan
      SessionState --> QueryPlanner
      QueryPlanner --> ExecutionPlan
      QueryPlanner --> SessionState
      SessionState --> CatalogProviderList
     CatalogProviderList --> CatalogProvider
      SessionState --> TableFunction
      SessionState --> FileFormatFactory
      FileFormatFactory --> FileFormat
      FileFormat --> SessionState
      FileFormat --> ExecutionPlan
      SessionState --> SessionConfig
      SessionState --> TableProviderFactory
      TableProviderFactory --> TableProvider
      SessionState --> RuntimeEnv
      RuntimeEnv --> MemoryPool
      RuntimeEnv --> DiskManager
      RuntimeEnv --> CacheManager
      RuntimeEnv --> ObjectStoreRegistry
      SessionState --> FunctionFactory
      FunctionFactory --> TableProvider
      TableFunction --> TableProvider
Loading

Circular found. It means they should be in the same crate

  graph TD;
      CatalogProvider --> TableProvider
      TableProvider --> SessionState
      SessionState --> TableFunction
      TableFunction --> TableProvider
Loading
  graph TD;
      CatalogProvider --> TableProvider
      TableProvider --> SessionState
      SessionState --> CatalogProviderList
      CatalogProviderList --> CatalogProvider
Loading
  graph TD;
      CatalogProvider --> TableProvider
      TableProvider --> SessionState
      SessionState --> FileFormatFactory
      FileFormatFactory --> FileFormat
     FileFormat --> SessionState
Loading
  graph TD;
      CatalogProvider --> TableProvider
      TableProvider --> SessionState
      SessionState --> QueryPlanner
      QueryPlanner --> SessionState
Loading

CatalogProvider + TableProvider + SessionsState + FileFormat + QueryPlanner should be moved to same crate

These 3 has no circular dependencies which mean we can/should move them out of core before moving catalog-common out of core.

  • PhysicalOptimizer
  • RuntimeEnv
  • SessionConfig

@comphead
Copy link
Contributor

This graph is amazing btw, first it shows how tightly coupled we are, the second is it gives really nice understandable picture

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 16, 2024

Current dependencies for TableProvider that are used in datafusion

  graph TD;
      CatalogProvider --> TableProvider
      TableProvider --> QueryPlanner
      TableProvider --> SessionConfig
      TableProvider --> ExecutionProps
      TableProvider --> RuntimeEnv
      TableProvider --> FileFormat
      FileFormat --> SessionState
      QueryPlanner --> SessionState
Loading

DataFrameTableProvider: create_physical_plan
MemTable: ExplainOptions + ExplainOptions
ViewTable: create_physical_plan
ListingTable: ExecutionOptions + RuntimeEnv + FileFormat
SimpleCsvTable: create_physical_plan + TaskContext
IndexTableProvider: create_physical_plan

SessionState is mainly used in create_physical_plan, which I guess most of them are configs or functions 🤔
For FileFormat, most of them we need from SessionState are also configs

@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2024

These 3 has no circular dependencies which mean we can/should move them out of core before moving catalog-common out of core.

PhysicalOptimizer

I agree -- I think we can pull the physical optimzier rules into their own crate.

The challenge would be any back dependencies (e.g. if any depend directly on execution plans in the core like ParquetExec.

RuntimeEnv

I think this is already out of core (in datafusion-execution: https://github.com/apache/datafusion/blob/main/datafusion/execution/src/runtime_env.rs

SessionConfig

Likewise, this is already in datafusion-execution and not in core: https://github.com/apache/datafusion/blob/main/datafusion/execution/src/config.rs#L94-L93

@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2024

Shall we file a ticket / epic to pull physical optimizer out?

@findepi
Copy link
Member

findepi commented Jul 17, 2024

newcomer here, so maybe i say something stupid, please bear with me
These circular dependencies seem to revolve around SessionState which is big but also common needed.
What about introducing an interface to that state, separating internal implementation (core) and others can access (API)?

@jayzhan211
Copy link
Contributor

SessionState itself has implement FunctionRegistry, OptimizerConfig, are they similar to your idea?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 17, 2024

I had a proposal in #11420 to minimize dependency with small struct. I think replacing it with another smaller trait are similar idea, both looks good to me

@findepi
Copy link
Member

findepi commented Jul 17, 2024

I don't know what i am doing, but here is what i had in mind: #11516

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 17, 2024

If we're planning to replace SessionState, I'd like the minimum dependencies. This helps avoid potential circular dependency issues that might arise from other components, even if they're not necessary.

@findepi
Copy link
Member

findepi commented Aug 20, 2024

5. (eventually) the LIstingTableProvider (which is by far the most complicated) moved to its own crate datafusion-catalog-listing

This will require moving file formats (datafusion/core/src/datasource/file_format) too.
I think this might be a good fit for a yet another top level module. What do you think?

@jayzhan211
Copy link
Contributor

And we could put the individual datasources like
datafusion-datasource-parquet
datafusion-datasource-csv
datafusion-datasource-json
...

We could try to move them out to a crate datasource first and maybe individual data source crate if required

@alamb
Copy link
Contributor Author

alamb commented Aug 20, 2024

We could try to move them out to a crate datasource first and maybe individual data source crate if required

I think this would be ideal

The parquet one in particular is quite large. The others are not as large

@jayzhan211
Copy link
Contributor

@findepi Is there any blocker of moving catalog / datasource out of core?

@findepi
Copy link
Member

findepi commented Sep 2, 2024

@jayzhan211 not aware of any blockers, but not working on this right now, if this is what you're asking about.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 9, 2024

@jonathanc-n
The main idea behind extracting a module is to identify dependencies by looking at which crates are imported.

For example, if we want to move CsvExec out, we’ll also need to move FileScanConfig and FileCompressionType. To extract FileScanConfig, we’ll need to move PartitionedFile as well. This process will reveal that we must move the entire datasource::listing module first, and whenever we need SessionState, replace it with trait Session

@jonathanc-n
Copy link
Contributor

jonathanc-n commented Nov 9, 2024

Oh i see, should we be putting priority on this soon? or should this be visited back another time?

@jayzhan211
Copy link
Contributor

Oh i see, should we be putting priority on this soon? or should this be visited back another time?

I will help to review with this.

@edmondop
Copy link
Contributor

I went through the thread and I couldn't figure out if we had identified/agreed on a set of issues that compose this epic @alamb

@alamb
Copy link
Contributor Author

alamb commented Nov 13, 2024

Oh i see, should we be putting priority on this soon? or should this be visited back another time?

In my opinion, this would qualify as "Things that make DataFusion easier to work with" (details here) and thus I will also try and find time to help this project along (both reviews as well as project planning)

I have been annoyed recently at the amount of time it takes to compile datafusion

For example, if we want to move CsvExec out, we’ll also need to move FileScanConfig and FileCompressionType. To extract FileScanConfig, we’ll need to move PartitionedFile as well. This process will reveal that we must move the entire datasource::listing module first, and whenever we need SessionState, replace it with trait Session

I went through the thread and I couldn't figure out if we had identified/agreed on a set of issues that compose this epic @alamb

Here is what I suggest:

  1. Try and complete moving the items listed in this issues description.
  2. Do NOT try and move ListingTableProider as part of this project (that is big enough that I think we should make its own epic for the reasons @jonathanc-n mentions)

I took a quick look and I think there are two things we could move to datafusion-catalog to complete this ticket:

  1. The Memory* providers: https://github.com/apache/datafusion/blob/4c1ec807fb70ef8abe96299760dd2faaa556a49c/datafusion/core/src/catalog_common/memory.rs
  2. The InformationSchema providers: https://github.com/apache/datafusion/blob/4c1ec807fb70ef8abe96299760dd2faaa556a49c/datafusion/core/src/catalog_common/information_schema.rs

Perhaps we can file tickets for those items and I can link them to this epic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants