Expose File.stat and improve Path.load #339

haesbaert · 2022-10-05T15:44:51Z

Path.load was doubling the buffer and trying to fit enough until it saw an EOF. This would cause many unecessary large allocations and my poor 16GB ram laptop couldn't even load a 5GB file without getting OOMed.

This diff makes load create an initial buffer of the size of the file and load exactly that amount of bytes in it.

Also, using take_all here isn't a good idea since it will resize even if initial_size (capacity) is adequate, that's because it does:
ensure t.len(0) (0+1), which ends up allocating another bigarray.

talex5

Thanks, this does need improving.

I'm not sure size is the right method, though. Perhaps we need a more general stat operation?

The Linux statx system call takes a mask saying what information you want. Maybe we could do something like that?

lib_eio/path.ml

lib_eio/path.mli

lib_eio_linux/eio_linux.ml

haesbaert · 2022-10-05T18:58:25Z

Let me know what you think, I've added two versions of Buf_read.load, the problem with using read_exact is that if the file shrinks mid-read we just throw an EOF, the second issue is that due to the integer overflow in luv, read_exact would end up always giving an EOF.
So I've added both versions, one that copes with getting an early EOF, and one that doesn't.
I've fixed the Uring fstat to use the Large version as well.
If you don't want to expose size as you mentioned before, we have to decide what to put and how to export stat, since we would need to see what Luv offers in windows and so on.

talex5

the problem with using read_exact is that if the file shrinks mid-read we just throw an EOF

OK, that seems like a good reason to use the Buf_read!

the second issue is that due to the integer overflow in luv, read_exact would end up always giving an EOF.

If the low-level read operation doesn't work correctly, then neither will Buf_read built on top of it. Eio_luv should just limit the buffer size to something it can handle. Callers need to handle short reads anyway (and I doubt to OS is going to write more than 2GB in one go anyway).

I've fixed the Uring fstat to use the Large version as well.

Thanks!

If you don't want to expose size as you mentioned before, we have to decide what to put and how to export stat, since we would need to see what Luv offers in windows and so on.

Yes, it needs a bit of thinking about. But I'd prefer to avoid adding a new method that everyone has to implement, when we know we're going to get rid of it again soon.

lib_eio/path.ml

lib_eio_linux/eio_linux.ml

haesbaert · 2022-10-12T22:41:30Z

This is ready for review, let me know what you think.

talex5 · 2022-10-14T09:52:56Z

lib_eio/path.ml

-  Buf_read.take_all buf
+let load ((t:#Fs.dir), path) =
+  let st = t#stat path in
+  let mymax = min Sys.max_string_length Sys.max_array_length in


Why max_array_length?

Because we have to store it in a Cstruct.t before the converting it to a string, and max_array_length < max_string_length, at least on amd64.

As far as I know, that doesn't apply to bigarrays like cstruct.

Also, it occurs to me that, since we're opening the file anyway, it would be simpler to use File.stat here, not Path.stat.

I did check and bigarray is constrained by the that limit, but I might have been mistaken I'll check again.

I moved it to File, problem is I can't call it a unix domain socket, I've added the method to File.ro, to have it in socket it would have to be deeper, I'm not sure what is preferred here ?

I did check and bigarray is constrained by the that limit, but I might have been mistaken I'll check again.

Could you post a link to the code. It seems very surprising to me that a "big array" is the same size as a normal one.

In particular, https://v2.ocaml.org/api/Bigarray.html says:

Bigarrays are not limited in size, unlike OCaml arrays.

I did check and bigarray is constrained by the that limit, but I might have been mistaken I'll check again.

Could you post a link to the code. It seems very surprising to me that a "big array" is the same size as a normal one.

In particular, https://v2.ocaml.org/api/Bigarray.html says:

Bigarrays are not limited in size, unlike OCaml arrays.

You're absolutely correct, I think my confusion came from the fact that Bigarray was using Array underneath and I assumed the same, but it's not:

caml_make_vect (Array.create) checks for Max_wosize where Bigarray's caml_ba_create does not. I've removed the test.

lib_eio/path.ml

lib_eio/fs.ml

lib_eio/path.ml

lib_eio_linux/eio_linux.ml

lib_eio_luv/eio_luv.ml

talex5 · 2022-10-14T10:40:21Z

tests/fs.md

@@ -495,3 +499,25 @@ Unconfined:
 +rename <fs:../foo> to <fs:foo> -> ok
 - : unit = ()
 ```
+
+# Stat


With make test_luv, I get:

+|Exception: Eio.Fs.Not_found ("stat_link", _)

oopsy, will fix it.

heh, File.realpath resolves the symlink and returns ENOENT, fix is not trivial, but since we don't do it in u-ring and therefore behaves differently, this is a bug. We're not using realpath(3) in Uring.

I guess we should just do the realpath on the parent, then? It's fine to lstat a symlink even if we don't have access to its target.

Oh that's smart, I hadn't considered it.

I won't address the realpath on the parent here, it's outside of the scope imho.

Yes, now Path.stat is gone - we can worry about that later :-) PR title needs updating too...

haesbaert · 2022-10-18T17:02:17Z

This is a bit in WIP, I think I've addressed all points but I'm not too sure about File.stat

avsm · 2022-10-21T16:32:08Z

lib_eio/file.ml

+(** Tranditional Unix permissions. *)
+module Unix_perm = struct
+  type t = int
+  (** This is the same as {!Unix.file_perm}, but avoids a dependency on [Unix]. *)


i'm not so keen on this in the portable eio core, as we may just have very simple filesystem backends. It would be nice to not have to dummy out Stat.t with fake values for uid/gid, for example.

But this was in Path, Path is already part of the portable core, or did I misunderstand something ?

I'm unsure where to put the Stat, I was thinking on adding a new effect on Eio_unix.Private:
| Stat : <unix_fd> -> Stat.t Effect.t

Does that make sense, technically all we need is a fd, or should I add to Eio_unix.unix_fd directly ?

There's no problem with having the concept of "Unix permission" itself in the core - even non-Unix systems will often want to access Unix filesystems, tarballs, etc.

A more interesting question is whether we should add type parameters to dir to cover different types of file-system. e.g. our dir could become Unix_perm.t dir. However, that could be annoying, and any FS that can be mounted on Linux already has some suitable mapping.

Linux's statx(2) already allows you to do a stat that might or might not return some values (e.g. you can say you don't care about the UID and it is free to leave that field blank). So ideally we want a way to indicate which fields are provided even there.

haesbaert · 2022-10-27T10:49:27Z

Had to step back a bit on this diff, I've rebased and force-pushed, from the new commit message:
I apologize for wasting everyone's time and radically changing it, but the whole luv bits were too insane for my taste, see details below.
I might be doing the whole interface/OO thing wrong and would appreciate some pointers here.

    Expose `stat` and improve Path.load

    `stat` can be called on files and flows in order to discover what is the backing
    descriptor. Some notes:
      - Use Unix.Largestat instead of Unix.stat
      - Add a portable stat for the user
      - Moved Path.Unix_perm into File.Unix_perm to avoid cyclical ref

    A previous attempt of this diff tried to make proper use of Luv.File.Stat but
    there were multiple issues: `mode` was not properly exported so I had to add a
    discovery.ml; it only works for a subset of file types and we had to manually
    convert timestamps to float; all doable but an insufferable amount of code to end
    up with less functionality than just calling into Unix.Largestat.stat.

    There are still some kinks:
      - Luv will always follow a link, so we can't really stat on a symbolic link,
        this is because we rely on realpath(3) which resolves the link.
      - We can't open named pipes in with luv atm, so we also can't stat it (#357).
      - I did test all file-modes, but didn't add all to the test since they are
        impractical (opening /dev/tty for Character_special...)

talex5

I apologize for wasting everyone's time and radically changing it, but the whole luv bits were too insane for my taste, see details below.

Sounds good. Do we really need to add it to the socket types, though? It doesn't seem useful for anything.

haesbaert · 2022-10-27T11:13:25Z

I apologize for wasting everyone's time and radically changing it, but the whole luv bits were too insane for my taste, see details below.

Sounds good. Do we really need to add it to the socket types, though? It doesn't seem useful for anything.

Frankly, can't see much uses, and can only remember one time I relied on it, I've added it more for the sake of correctness/completeness, and to use it as an excuse to learn some ocaml OO. I can gut it if it's the case.

talex5 · 2022-10-27T11:25:55Z

Frankly, can't see much uses, and can only remember one time I relied on it, I've added it more for the sake of correctness/completeness, and to use it as an excuse to learn some ocaml OO. I can gut it if it's the case.

Yes please. It will be a pain for e.g. Mirage having to support fstat.

haesbaert · 2022-10-27T15:11:30Z

Frankly, can't see much uses, and can only remember one time I relied on it, I've added it more for the sake of correctness/completeness, and to use it as an excuse to learn some ocaml OO. I can gut it if it's the case.

Yes please. It will be a pain for e.g. Mirage having to support fstat.

Ack, leme know if this is ok and then I'll squash and merge.

lib_eio/file.ml

lib_eio/fs.ml

talex5 · 2022-10-31T10:51:50Z

lib_eio/path.ml

+let load ((t:#Fs.dir), path) =
+  with_open_in (t, path) @@ fun flow ->


Suggested change

let load ((t:#Fs.dir), path) =

with_open_in (t, path) @@ fun flow ->

let load t =

with_open_in t @@ fun flow ->

This isn't resolved. Keeping it as one will also fix the problem that the string in the error loses the dir tag, since you can then use Path.pp.

(we might need a bit of variable renaming in this module; lots of functions here are using t for the directory and path for a full path, which is now a bit backwards)

I'm a bit confused, we don't really include the dir in the error string for any of the Eio.Fs exceptions:

let open_in ~sw ((t:#Fs.dir), path) = t#open_in ~sw path method open_in ~sw path = let fd = File.open_ ~sw (self#resolve path) [`NOFOLLOW; `RDONLY] |> or_raise_path path in (flow fd :> <Eio.File.ro; Eio.Flow.close>) ^^^^

utop # let _ = Eio_luv.run (fun env -> let cwd = Eio.Stdenv.cwd env in Eio.Switch.run @@ fun sw -> Eio.Path.load Eio.Path.(cwd / "not_there"));; Exception: Eio__Fs.Not_found ("not_there", Eio_luv.Luv_error 877502405). let _ = Eio_linux.run (fun env -> let cwd = Eio.Stdenv.cwd env in Eio.Switch.run @@ fun sw -> Eio.Path.load Eio.Path.(cwd / "not_there"));; Exception: Eio__Fs.Not_found ("not_there", Unix.Unix_error (Unix.ENOENT, "openat2", "")).

OK, let's sort that out later then! It just seemed a shame to be splitting the path up only for the purpose of producing a slightly less informative error message.

lib_eio_linux/eio_linux.ml

lib_eio_linux/tests/eurcp_lib.ml

talex5 · 2022-10-31T10:59:14Z

tests/fs.md

@@ -495,3 +499,25 @@ Unconfined:
 +rename <fs:../foo> to <fs:foo> -> ok
 - : unit = ()
 ```
+
+# Stat


I guess we should just do the realpath on the parent, then? It's fine to lstat a symlink even if we don't have access to its target.

talex5

This looks pretty good now! (Would be nice to get the dir tag in the error message though). Can be merged once rebased / squashed.

talex5 · 2022-11-03T12:14:01Z

tests/fs.md

@@ -495,3 +499,25 @@ Unconfined:
 +rename <fs:../foo> to <fs:foo> -> ok
 - : unit = ()
 ```
+
+# Stat


Yes, now Path.stat is gone - we can worry about that later :-) PR title needs updating too...

Some notes: - Use Unix.Largestat instead of Unix.stat - Add a portable stat for the user - Moved Path.Unix_perm into File.Unix_perm to avoid cyclical ref A previous attempt of this diff tried to make proper use of Luv.File.Stat but there were multiple issues: `mode` was not properly exported so I had to add a discovery.ml; it only works for a subset of file types and we had to manually convert timestamps to float; all doable but an insuferable amount of code to end up with less functionality than just calling into Unix.Largestat.stat. There are still some kinks: - Luv will always follow a link, so we can't really stat on a symbolic link, this is because we rely on realpath(3) which resolves the link. - We can't open named pipes in with luv atm, so we also can't stat it (ocaml-multicore#357). - I did test all filemodes, but didn't add all to the test since they are impractical (opening /dev/tty for Character_special...) -- Path.load was doubling the buffer and trying to fit enough until it saw an EOF. This would cause many unecessary large allocations and my poor 16GB ram laptop couldn't even load a 5GB file without getting OOMed. This diff makes load create an initial buffer of the size of the file and load exactly that amount of bytes in it.

@talex5

CHANGES: API changes: - Unify IO errors as `Eio.Io` (@talex5 ocaml-multicore/eio#378). This makes it easy to catch and log all IO errors if desired. The exception payload gives the type and can be used for matching specific errors. It also allows attaching extra information to exceptions, and various functions were updated to do this. - Add `Time.Mono` for monotonic clocks (@bikallem @talex5 ocaml-multicore/eio#338). Using the system clock for timeouts, etc can fail if the system time is changed during the wait. - Allow datagram sockets to be created without a source address (@bikallem @haesbaert ocaml-multicore/eio#360). The kernel will allocate an address in this case. You can also now control the `reuse_addr` and `reuse_port` options. - Add `File.stat` and improve `Path.load` (@haesbaert @talex5 ocaml-multicore/eio#339). `Path.load` now uses the file size as the initial buffer size. - Add `Eio_unix.pipe` (@patricoferris ocaml-multicore/eio#350). This replaces `Eio_linux.pipe`. - Avoid short reads from `getrandom(2)` (@haesbaert ocaml-multicore/eio#344). Guards against buggy user code that might not handle this correctly. - Rename `Flow.read` to `Flow.single_read` (@talex5 ocaml-multicore/eio#353). This is a low-level function and it is easy to use it incorrectly by ignoring the possibility of short reads. Bug fixes: - Eio_luv: Fix non-tail-recursive continue (@talex5 ocaml-multicore/eio#378). Affects the `Socket_of_fd` and `Socketpair` effects. - Eio_linux: UDP sockets were not created close-on-exec (@talex5 ocaml-multicore/eio#360). - Eio_linux: work around io_uring non-blocking bug (@haesbaert ocaml-multicore/eio#327 ocaml-multicore/eio#355). The proper fix should be in Linux 6.1. - `Eio_mock.Backend`: preserve backtraces from `main` (@talex5 ocaml-multicore/eio#349). - Don't lose backtrace in `Switch.run_internal` (@talex5 ocaml-multicore/eio#369). Documentation: - Use a proper HTTP response in the README example (@talex5 ocaml-multicore/eio#377). - Document that read_dir excludes "." and ".." (@talex5 ocaml-multicore/eio#379). - Warn about both operations succeeding in `Fiber.first` (@talex5 ocaml-multicore/eio#358, reported by @iitalics). - Update README for OCaml 5.0.0~beta2 (@talex5 ocaml-multicore/eio#375). Backend-specific changes: - Eio_luv: add low-level process support (@patricoferris ocaml-multicore/eio#359). A future release will add Eio_linux support and a cross-platform API for this. - Expose `Eio_luv.Low_level.Stream.write` (@patricoferris ocaml-multicore/eio#359). - Expose `Eio_luv.Low_level.get_loop` (@talex5 ocaml-multicore/eio#371). This is needed if you want to create resources directly and then use them with Eio_luv. - `Eio_linux.Low_level.openfile` is gone (@talex5 ocaml-multicore/eio#378). It was just left-over test code.

talex5 reviewed Oct 5, 2022

View reviewed changes

lib_eio/path.ml Outdated Show resolved Hide resolved

lib_eio/path.mli Outdated Show resolved Hide resolved

lib_eio_linux/eio_linux.ml Outdated Show resolved Hide resolved

talex5 reviewed Oct 6, 2022

View reviewed changes

lib_eio/path.ml Outdated Show resolved Hide resolved

lib_eio_linux/eio_linux.ml Outdated Show resolved Hide resolved

haesbaert force-pushed the initialsize branch from e2f38d3 to 746db28 Compare October 11, 2022 21:01

haesbaert force-pushed the initialsize branch from 746db28 to 1b6754e Compare October 13, 2022 01:06

talex5 reviewed Oct 14, 2022

View reviewed changes

haesbaert mentioned this pull request Oct 18, 2022

Expose mode from Luv.File.Stat.t as an integral or expose equality aantron/luv#134

Open

haesbaert marked this pull request as draft October 20, 2022 19:53

avsm reviewed Oct 21, 2022

View reviewed changes

haesbaert force-pushed the initialsize branch from 57fc678 to f487afd Compare October 27, 2022 10:46

haesbaert marked this pull request as ready for review October 27, 2022 11:06

talex5 reviewed Oct 27, 2022

View reviewed changes

talex5 reviewed Oct 31, 2022

View reviewed changes

haesbaert marked this pull request as draft October 31, 2022 19:32

haesbaert marked this pull request as ready for review October 31, 2022 20:51

talex5 approved these changes Nov 3, 2022

View reviewed changes

haesbaert changed the title ~~Expose Path.size and fix Path.load~~ Expose Path.stat and fix Path.load Nov 3, 2022

haesbaert changed the title ~~Expose Path.stat and fix Path.load~~ Expose File.stat and fix Path.load Nov 3, 2022

haesbaert changed the title ~~Expose File.stat and fix Path.load~~ Expose File.stat and improve Path.load Nov 3, 2022

haesbaert force-pushed the initialsize branch from 74280aa to b7f87f1 Compare November 3, 2022 15:48

haesbaert merged commit 5c1eae4 into ocaml-multicore:main Nov 3, 2022

talex5 mentioned this pull request Dec 7, 2022

[new release] eio_main, eio_luv, eio_linux and eio (0.7) ocaml/opam-repository#22619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose File.stat and improve Path.load #339

Expose File.stat and improve Path.load #339

haesbaert commented Oct 5, 2022

talex5 left a comment

haesbaert commented Oct 5, 2022 •

edited

Loading

talex5 left a comment

haesbaert commented Oct 12, 2022

talex5 Oct 14, 2022

haesbaert Oct 14, 2022

talex5 Oct 15, 2022

talex5 Oct 15, 2022

haesbaert Oct 18, 2022 •

edited

Loading

talex5 Oct 31, 2022

haesbaert Oct 31, 2022

talex5 Oct 14, 2022

haesbaert Oct 14, 2022

haesbaert Oct 14, 2022

talex5 Oct 31, 2022

haesbaert Oct 31, 2022

haesbaert Oct 31, 2022

talex5 Nov 3, 2022

haesbaert commented Oct 18, 2022

avsm Oct 21, 2022

haesbaert Oct 24, 2022

talex5 Oct 24, 2022

haesbaert commented Oct 27, 2022 •

edited

Loading

talex5 left a comment

haesbaert commented Oct 27, 2022

talex5 commented Oct 27, 2022

haesbaert commented Oct 27, 2022

talex5 Oct 31, 2022

talex5 Nov 3, 2022

haesbaert Nov 3, 2022

talex5 Nov 3, 2022

talex5 Oct 31, 2022

talex5 left a comment

talex5 Nov 3, 2022

		let load ((t:#Fs.dir), path) =
		with_open_in (t, path) @@ fun flow ->

Expose File.stat and improve Path.load #339

Expose File.stat and improve Path.load #339

Conversation

haesbaert commented Oct 5, 2022

talex5 left a comment

Choose a reason for hiding this comment

haesbaert commented Oct 5, 2022 • edited Loading

talex5 left a comment

Choose a reason for hiding this comment

haesbaert commented Oct 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haesbaert Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haesbaert commented Oct 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haesbaert commented Oct 27, 2022 • edited Loading

talex5 left a comment

Choose a reason for hiding this comment

haesbaert commented Oct 27, 2022

talex5 commented Oct 27, 2022

haesbaert commented Oct 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talex5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haesbaert commented Oct 5, 2022 •

edited

Loading

haesbaert Oct 18, 2022 •

edited

Loading

haesbaert commented Oct 27, 2022 •

edited

Loading