Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add huggingface extension #261

Merged

Conversation

matthewmturner
Copy link
Collaborator

No description provided.

@matthewmturner
Copy link
Collaborator Author

@Xuanwo this is as far as i got today - im at the point where i need to now figure out how to map huggingface to object store semantics. I tried a quick create table statement and ended up with this error. the path looks okay to me but im not that familiar with huggingface and didnt get to look much into this.

image

I'll pick back up on this tomorrow but if you have any insight would it would be very helpful

hf_builder = hf_builder.root(root);
};
if let Some(token) = &huggingface_config.token {
hf_builder = hf_builder.repo_id(token);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is wrong.

@Xuanwo
Copy link

Xuanwo commented Jan 15, 2025

Hi, given the error output in your posted image, I assume we are trying to access an incorrect path. OpenDAL manages all paths internally, so we only need to provide the path relative to the repository root instead of trying to build the url.


I built a real example with the repo you are using:

use std::sync::Arc;
use opendal::Operator;
use opendal::Result;
use anyhow::Result;
use opendal::services::Huggingface;
use opendal::Operator;

#[tokio::main]
async fn main() -> Result<()> {
    // Create Huggingface backend builder
    let mut builder = Huggingface::default()
        // set the type of Huggingface repository
        .repo_type("dataset")
        // set the id of Huggingface repository
        .repo_id("HuggingFaceTB/finemath")
        // set the revision of Huggingface repository
        .revision("main")
        // set the root for Huggingface, all operations will happen under this root
        .root("/");

    let op: Operator = Operator::new(builder)?.finish();
    let entries = op.list("/").await?;
    println!("{:?}", entries.iter().map(|v| v.path()).collect::<Vec<_>>());

    let meta = op
        .stat("finemath-3plus/train-00000-of-00128.parquet")
        .await?;
    println!("{:?}", meta);

    Ok(())
}

The output will be:

["assets/", "finemath-3plus/", "finemath-4plus/", "infiwebmath-3plus/", "infiwebmath-4plus/", ".gitattributes", "README.md"]
Metadata { mode: FILE, is_current: None, is_deleted: false, cache_control: None, content_disposition: None, content_length: Some(507607173), content_md5: None, content_range: None, content_type: Some("application/json; charset=utf-8"), content_encoding: None, etag: Some("W/\"23e-Dio8lpah4iHFyrzOC8sgQMZGg8E\""), last_modified: Some(2024-12-19T09:49:58Z), version: None, user_metadata: None }

I hope this example effectively demonstrates how to properly configure the huggingface service here.

// I'm not that famliar with Huggingface so I'm not sure what permutations of config
// values are supposed to work.

let mut base_url = String::from("https://huggingface.co/");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing we can register url as hf://datasets/<repo_id>/. And visit the file in the way hf://datasets/HuggingFaceTB/finemath/finemath-3plus/train-00000-of-00128.parquet.

There may be some tricks things on the url handling inside datafusion and oebjct_store.

@matthewmturner
Copy link
Collaborator Author

@Xuanwo thanks much for the feedback and apologies for the delay getting back - currently on vacation and not online as much.

For your information I ended up creating a separate repo to test out opendal with datafusion to get minimal working example independent from the context of dft. Once I have it working there I'll finish this branch.

@Xuanwo
Copy link

Xuanwo commented Jan 21, 2025

For your information I ended up creating a separate repo to test out opendal with datafusion to get minimal working example independent from the context of dft. Once I have it working there I'll finish this branch.

Hi, we have a great example for this: https://github.com/apache/opendal/blob/main/integrations/object_store/examples/datafusion.rs

@matthewmturner matthewmturner merged commit a787e0c into datafusion-contrib:main Jan 24, 2025
9 checks passed
@matthewmturner
Copy link
Collaborator Author

@Xuanwo got it working :) thanks for your help

@Xuanwo
Copy link

Xuanwo commented Jan 24, 2025

@Xuanwo got it working :) thanks for your help

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants