-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
list_objects_v2()
is much slower than it should be
#619
Comments
Will have a look to see if there is any quick wins with thjs |
If you have some idea in how to improve the speed, please feel free to raise a PR |
A bit tedious but much faster parsing is possible if you know the structure of the xml, e.g. for listing objects you can do something like contents <- xml2::xml_find_all(resp_content, "d1:Contents")
xml2::xml_find_all(contents, "d1:Key") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:LastModified") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:ETag") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:ChecksumAlgorithm") |> xml2::xml_text()
xml2::xml_find_all(contents, "d1:Size") |> xml2::xml_integer()
owner <- xml2::xml_find_all(contents, "d1:Owner")
owner |>
xml2::xml_find_all("d1:ID") |>
xml2::xml_text()
owner |>
xml2::xml_find_all("d1:DisplayName") |>
xml2::xml_text()
xml2::xml_find_all(contents, "d1:StorageClass") |> xml2::xml_text() |
I wonder if there is away to make this more generic 🤔 As this function parses all responses. |
Yes, this can be done more generic. I hacked something together: decode_xml <- function(xml, interface) {
if (!is.list(interface)) {
out <- parse_xml_elt(xml, interface)
return(out)
}
nms <- names(interface)
out <- list()
for (i in seq_along(interface)) {
key <- paste0("d1:", names(interface)[[i]])
interface_i <- interface[[i]]
type <- attr(interface_i, "tags")$type
flattened <- attr(interface_i, "tags")$flattened
elts <- xml2::xml_find_all(xml, key, flatten = FALSE)
if (inherits(xml, "xml_nodeset")) {
idx <- lengths(elts) == 0
parsed_elts <- parse_xml_elt(elts, interface_i, idx)
} else {
parsed_elts <- parse_xml_elt(elts, interface_i)
}
if (isTRUE(flattened)) {
n_i <- length(parsed_elts)
parsed_elts <- setNames(parsed_elts, rep(nms[[i]], n_i))
} else {
parsed_elts <- setNames(list(parsed_elts), nms[[i]])
}
out <- c(out, parsed_elts)
}
out
}
parse_xml_elt <- function(xml, interface, idx = NULL) {
n <- length(xml)
if (!is.null(idx) && n > 0) {
xml <- structure(unlist(recursive = FALSE, xml), class = "xml_nodeset")
}
tags <- attr(interface, "tags")
type <- tags$type
flattened <- tags$flattened
if (type == "structure") {
val <- decode_xml(xml, interface[[1]])
default <- xml2::as_xml_document(list())
default <- decode_xml(default, interface[[1]])
} else if (type == "list") {
val <- decode_xml(xml, interface[[1]])
default <- xml2::as_xml_document(list())
default <- decode_xml(default, interface[[1]])
} else if (type == "boolean") {
val <- xml2::xml_text(xml) == "true"
default <- NA
} else if (type == "string") {
val <- xml2::xml_text(xml)
default <- NA_character_
} else if (type == "timestamp") {
val <- xml2::xml_text(xml)
val <- strptime(val, format = "%Y-%m-%dT%H:%M:%S")
val <- as.POSIXct(val)
default <- strptime(NA, format = "%Y-%m-%dT%H:%M:%S")
default <- as.POSIXct(default)
} else if (type == "integer") {
val <- xml2::xml_integer(xml)
default <- NA_integer_
} else {
browser()
}
if (isTRUE(flattened)) {
out <- purrr::transpose(val)
} else {
# `unlist()` might drop elements which we need to fill with the missing value
# this would be a bit easier with `vctrs::vec_init()`
out <- rep(default, n)
if (is.null(idx)) {
idx <- FALSE
}
out[!idx] <- val
}
out
} The structure is similar but not quite the same as before (e.g. it skips some nesting that's not in the interface). It would also be possible (and probably be nicer) to return a data frame instead of flattening a list, e.g. the |
@mgirlich thanks for this, will need to benchmark it. Is it possible to do this without adding any extra dependencies? I would like to keep the dependencies low if possible |
It can be done without purrr. But before doing that we should:
|
We should keep it to a list as we want to keep the api consistent with prior versions and mimic the other aws sdks 😄 |
I did a check of paws performance and interesting I am getting fairly decent results: library(paws)
svc <- s3()
profvis::profvis({
resp <- svc$list_objects_v2(Bucket = "my-dummy-bucket")
}) Note: this returned the maximum number of items for the AWS api call (1000).
@mgirlich can you share the R session environment? Would be interesting to see what is causing this bottle neck. |
The difference in timing is because I'm listing many more than 1.000 objects. But still the vast majority of time is spent in |
Keeping open until paws.common released to cran |
Closing a paws.common 0.6.0 has been release onto the cran |
Most of the time is spend parsing the xml output. It would be great to speed this up.
The text was updated successfully, but these errors were encountered: