You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a dataset with many files (over half a million) and with an intricate file structure. I don't need to use all the files at once. In fact, in different phases of the project, I will need to select different subsets of files. For now, I am using information encoded in the file names or in the file path (like the subdirectory under which a file lives) to decide which files to use, but, ideally, I am looking forward to fast and flexible ways to interact with DataSets.jl datasets.
I have created the following toy TOML-embedded dataset to illustrate the case:
Imagine I need the files such that the string "b.file" is in the file path.
I can take advantage of the abstract tree interface and iterate over Leaves, but this turns out to be very slow.
using DataSets
using AbstractTrees
ds =open(dataset("embedded_letters"))
for file inLeaves(ds)
ifoccursin("b.fil", string(file.path))
println(file.path)
endend> letters/b/b.file
> letters/b/bb/bbb/bbb.file
> letters/b/bb/bbb/bbb.filo
I thought of an alternative approach using the FileTrees.jl package. Assuming that I can reproduce the file structure of embedded_letters as a FileTrees.jl tree, then I find that filtering files is much faster. (Here, I create the tree manually.)
using FileTrees
using FilePathsBase
ft =maketree(
"letters"=> [
"a"=> ["aa"=> ["aa.file"]],
"b"=> ["b.file", "bb"=> ["bbb"=> ["bbb.file", "bbb.filo"]]],
],
)
ft_filtered =filter(x ->occursin("b.fil", x.name), ft, dirs =false);
for file infiles(ft_filtered)
println(Path(file))
end> letters/b/b.file
> letters/b/bb/bbb/bbb.file
> letters/b/bb/bbb/bbb.filo
I am considering to do the filtering of files using FileTrees.jl, prepare a list of paths as in the example, and finally visit the selected paths in the DataSets.jl dataset. Of course, it would be much better to improve on the DataSets.jl side so that I can avoid the detour.
The text was updated successfully, but these errors were encountered:
So operations such as listing all children etc. have to do some amount of non-trivial work to figure out e.g. the paths of all children, and then construct the BlobTree objects. And every time you do the same operation, you kinda have to re-do the work. FileTrees, on the other hand, basically just has pointers to parent and child nodes, which makes is very cheap to do the same operation.
I feel that the (non-breaking) path forward here would be to add a new representation of a BlobTree, that would use the pointer/linked-list-y approach.
I am working on a dataset with many files (over half a million) and with an intricate file structure. I don't need to use all the files at once. In fact, in different phases of the project, I will need to select different subsets of files. For now, I am using information encoded in the file names or in the file path (like the subdirectory under which a file lives) to decide which files to use, but, ideally, I am looking forward to fast and flexible ways to interact with DataSets.jl datasets.
I have created the following toy TOML-embedded dataset to illustrate the case:
Imagine I need the files such that the string "b.file" is in the file path.
I can take advantage of the abstract tree interface and iterate over Leaves, but this turns out to be very slow.
I thought of an alternative approach using the FileTrees.jl package. Assuming that I can reproduce the file structure of
embedded_letters
as a FileTrees.jl tree, then I find that filtering files is much faster. (Here, I create the tree manually.)I am considering to do the filtering of files using FileTrees.jl, prepare a list of paths as in the example, and finally visit the selected paths in the DataSets.jl dataset. Of course, it would be much better to improve on the DataSets.jl side so that I can avoid the detour.
The text was updated successfully, but these errors were encountered: