export_gtfs() #10

dhersz · 2021-03-16T22:10:24Z

dhersz
Mar 16, 2021
Maintainer

Hello folks,

As of 6185db1 a first version export_gtfs() is up and running. Here is how it works right now. Much improvement is surely yet to come.

Basic usage

Just input an GTFS object and the path where it should be written to. By default it writes every element inside it:

library(gtfsio)

gtfs_path <- system.file("extdata/ggl_gtfs.zip", package = "gtfsio")
gtfs <- import_gtfs(gtfs_path)

tmpf <- tempfile("gtfs", fileext = ".zip")

export_gtfs(gtfs, tmpf)
zip::zip_list(tmpf)$filename
#>  [1] "calendar_dates.txt"  "fare_attributes.txt" "fare_rules.txt"     
#>  [4] "feed_info.txt"       "frequencies.txt"     "levels.txt"         
#>  [7] "pathways.txt"        "routes.txt"          "shapes.txt"         
#> [10] "stop_times.txt"      "stops.txt"           "transfers.txt"      
#> [13] "translations.txt"    "trips.txt"           "agency.txt"         
#> [16] "attributions.txt"    "calendar.txt"

But you can control which files are written to disk with the files argument:

export_gtfs(gtfs, tmpf, files = c("shapes", "trips"))
zip::zip_list(tmpf)$filename
#> [1] "shapes.txt" "trips.txt"

If an element named . is present, which is used by {tidytransit} to hold "auxiliary" tables, it is not exported.

gtfs$. <- list(
  aux_table = data.table::data.table(column = 1:5)
)

export_gtfs(gtfs, tmpf)
zip::zip_list(tmpf)$filename
#>  [1] "calendar_dates.txt"  "fare_attributes.txt" "fare_rules.txt"     
#>  [4] "feed_info.txt"       "frequencies.txt"     "levels.txt"         
#>  [7] "pathways.txt"        "routes.txt"          "shapes.txt"         
#> [10] "stop_times.txt"      "stops.txt"           "transfers.txt"      
#> [13] "translations.txt"    "trips.txt"           "agency.txt"         
#> [16] "attributions.txt"    "calendar.txt"

You can use the overwrite argument to control whether existing files should be overwritten or not:

export_gtfs(gtfs, tmpf, overwrite = FALSE)
#> Error in export_gtfs(gtfs, tmpf, overwrite = FALSE): The file pointed by 'path' exists, but 'overwrite' is set to FALSE.

And trying to export an element that doesn't exist results in an error:

export_gtfs(gtfs, tmpf, files = "ola")
#> Error in export_gtfs(gtfs, tmpf, files = "ola"): The provided GTFS object does not contain the following elements specified in 'files': 'ola'

Notes

I see that tidytransit::write_gtfs() has a few other parameters not included in the function (compression_level and as_dir).

I assumed that using the most strict compression would be desirable, but I'm happy to change it if you think otherwise (I'm not sure how much it affects performance, to be honest). Regarding as_dir, if I read the code correctly it creates a directory instead of a .zip file, right? I'm not sure if I like it, but I'd like to hear your opinion on it.

Basic behaviour and handling auxiliary columns

No conversions are made inside export_gtfs() (i.e. the function expects a GTFS that follows the standards). I figured out that, since each one of our packages might handle some columns differently, especially those that are date and time related, it would be better to leave any conversions to be made outside the function. The workflow I'm thinking of could be very roughly translated to something like:

gtfstools_write_gtfs <- function(gtfs) {
  # 'gtfs' is not necessarily formatted according to the standards 
  ...
  gtfs <- convert_to_standards() # each one of our packages would have a similar function
  gtfsio::export_gtfs(gtfs)
  ...
  ...
}

Approaching the problem with the workflow above solves the problem of making sure that fields are correctly formatted in the final .zip file. But one issue remains unsolved:

How do we deal with auxiliary columns?

Right now export_gtfs deals with auxiliary tables exactly like {tidytransit} does it. Elements inside the . sub-list are not written to disk. So any data frames located outside . are still exported, even if they are not specified in the official reference. This is very useful when dealing with non-standard GTFS (I've never used them much, but a few extensions are built on top of the official GTFS format).

The problem arises when we're dealing with auxiliary columns (e.g. arrival_time_hms and departure_time_hms created by tidytransit::set_hms_times() in stop_times). These columns should not be exported, but how do we differentiate them from extra columns that must be? @polettif suggested in an earlier discussion using a naming convention here, and I think it's a great idea. We could perhaps use a prefix (e.g. aux, resulting in column names such as aux_arrival_time_hms) that would signal that a column is auxiliary and thus should not be exported.

Another possible solution would be to create an argument to specify fields that should not be exported (e.g. something like no_export = list(stop_times = c("arrival_time_hms", "departure_time_hms")).

I prefer the naming convention. In my opinion it makes for a simpler way of specifying which columns should be written, both to final users and to developers, but I'd like to hear your thoughts on it, as usual.

Cheers!

rafapereirabr · 2021-03-18T01:54:41Z

rafapereirabr
Mar 18, 2021
Collaborator

Hi all.

@dhersz , thank you again for such an excellent contribution! Here are my quick 2 cents:

I personally do not see much use in an additional argument as_dir, but I understand if other would like to include it.
I don't know whether compression level could affect other packages/applications such as gtfsrouter, r5r or opentripplanner. Has anyoned checked on this before?
Regarding additional columns, I would suggest we have a Logical argument like standard_cols. If TRUE, the function only export standard GTFS columns. If FALSE, all files in the data are exported.

3 replies

mpadge Mar 18, 2021
Collaborator

@rafapereirabr totally agree on your 3rd point above - that would be very useful, especially because we could convert any "non-standard" feeds to standard form in 2 simple read-write lines - brilliant!

(And i've no idea about effects of compression either?)

polettif Mar 19, 2021
Maintainer

Thanks for the writeup @dhersz

I implemented as_dir mostly for debugging purposes so I can skip the zipping/unzipping step if I want to inspect a table, so I'd like to keep it.
I haven't done any benchmarks so far on compression_level. Depending on performance impact we could default to the highest compression or leave it as a param, I don't really mind either way.
standard_cols is indeed a nice solution. Just to specify though: Tables in the . sublist shouldn't be written in any case IMO since I don't really know how this would work in a zip (use a separate subdir?) and it's helpful to have a general temporary location within a feed object.
Re standard_cols: How do we handle non-standard files though? Maybe rename it to standard_only or something like that could help.
I did like the naming convention for additional columns but now I think it's not really helpful. Tidytransit will wrap gtfsio::export_gtfs anyways so it's easy to remove those package-specific columns around the convert_to_standard step @dhersz outlined.

dhersz Mar 19, 2021
Maintainer Author

@rafapereirabr

I don't know whether compression level could affect other packages/applications such as gtfsrouter, r5r or opentripplanner. Has anyoned checked on this before?

It probably doesn't affect their behaviour, just how fast they read the feeds. But I haven't tested it.

Regarding additional columns, I would suggest we have a Logical argument like standard_cols. If TRUE, the function only export standard GTFS columns. If FALSE, all files in the data are exported.

Cool, great idea, I'll implement it.

@polettif

I implemented as_dir mostly for debugging purposes so I can skip the zipping/unzipping step if I want to inspect a table, so I'd like to keep it.

No worries, I'll add it then.

I haven't done any benchmarks so far on compression_level. Depending on performance impact we could default to the highest compression or leave it as a param, I don't really mind either way.

Ok. Then I'll add it, and we can expose it in our packages if we see it fitting.

standard_cols is indeed a nice solution. Just to specify though: Tables in the . sublist shouldn't be written in any case IMO since I don't really know how this would work in a zip (use a separate subdir?) and it's helpful to have a general temporary location within a feed object.

I agree that . sublists should never be written.

Re standard_cols: How do we handle non-standard files though? Maybe rename it to standard_only or something like that could help.

Yeah, I also thought about renaming the argument. I'll keep your suggestion.

I did like the naming convention for additional columns but now I think it's not really helpful. Tidytransit will wrap gtfsio::export_gtfs anyways so it's easy to remove those package-specific columns around the convert_to_standard step @dhersz outlined.

Nice! Perhaps this is really better dealt inside each package.

Thanks for your input guys! I'll make the appropriate changes and get back to you with the results.

dhersz · 2021-03-19T19:36:42Z

dhersz
Mar 19, 2021
Maintainer Author

Changes introduced in last commit:

The standard_only argument is used to write only standard files and fields. It defaults to FALSE (i.e. by default, extra files and fields are written):

library(gtfsio)

gtfs_path <- system.file("extdata/ggl_gtfs.zip", package = "gtfsio")
gtfs <- import_gtfs(gtfs_path)

tmpf <- tempfile(fileext = ".zip")

gtfs$ola <- data.table::data.table(oi = 1:2, ola = 3:4)

export_gtfs(gtfs, tmpf)
zip::zip_list(tmpf)$filename
#>  [1] "calendar_dates.txt"  "fare_attributes.txt" "fare_rules.txt"     
#>  [4] "feed_info.txt"       "frequencies.txt"     "levels.txt"         
#>  [7] "pathways.txt"        "routes.txt"          "shapes.txt"         
#> [10] "stop_times.txt"      "stops.txt"           "transfers.txt"      
#> [13] "translations.txt"    "trips.txt"           "agency.txt"         
#> [16] "attributions.txt"    "calendar.txt"        "ola.txt"

export_gtfs(gtfs, tmpf, standard_only = TRUE)
zip::zip_list(tmpf)$filename
#>  [1] "calendar_dates.txt"  "fare_attributes.txt" "fare_rules.txt"     
#>  [4] "feed_info.txt"       "frequencies.txt"     "levels.txt"         
#>  [7] "pathways.txt"        "routes.txt"          "shapes.txt"         
#> [10] "stop_times.txt"      "stops.txt"           "transfers.txt"      
#> [13] "translations.txt"    "trips.txt"           "agency.txt"         
#> [16] "attributions.txt"    "calendar.txt"

Note that this argument affect both extra files and extra fields in required/optional files:

gtfs$levels
#>    level_id level_index level_name elevation
#> 1:       L0           0     Street         0
#> 2:       L1          -1  Mezzanine        -6
#> 3:       L2          -2 Southbound       -18
#> 4:       L3          -3 Northbound       -24

export_gtfs(gtfs, tmpf, files = "levels", standard_only = TRUE)
import_gtfs(tmpf)
#> $levels
#>    level_id level_index level_name
#> 1:       L0           0     Street
#> 2:       L1          -1  Mezzanine
#> 3:       L2          -2 Southbound
#> 4:       L3          -3 Northbound

An error is thrown if an extra file is specified in files while standard_only is set to TRUE:

export_gtfs(gtfs, tmpf, files = c("levels", "ola"), standard_only = TRUE)
#> Error in export_gtfs(gtfs, tmpf, files = c("levels", "ola"), standard_only = TRUE): Non-standard file specified in 'files', even though 'standard_only' is set to TRUE: 'ola'

The compression_level controls... the compression level (defaults to best compression level, 9):

export_gtfs(gtfs, tmpf, compression_level = 1)
file.size(tmpf)
#> [1] 4130

export_gtfs(gtfs, tmpf)
file.size(tmpf)
#> [1] 4080

The as_dir argument specifies whether the feed should be written as a directory, instead of a .zip file (defaults to FALSE). Note that you can still specify the path with a .zip extension, but the result with as_dir = TRUE will always be a directory:

tmpf <- tempfile(fileext = ".zip")
tmpf
#> [1] "/tmp/Rtmp5NnqB2/file523c1461af29.zip"

export_gtfs(gtfs, tmpf, as_dir = TRUE)
dir.exists(tmpf)
#> [1] TRUE

tmpf <- tempfile()
tmpf
#> [1] "/tmp/Rtmp5NnqB2/file523c38ddd388"

export_gtfs(gtfs, tmpf, as_dir = TRUE)
dir.exists(tmpf)
#> [1] TRUE

The function, thus, doesn't try to guess if you refer to a directory from the path argument, relying solely on the as_dir argument for that. But an error will be thrown if a path without a .zip extension is passed when as_dir = FALSE:

export_gtfs(gtfs, tmpf)
#> Error in export_gtfs(gtfs, tmpf): 'path' must have '.zip' extension. If you meant to create a directory please set 'as_dir' to TRUE.

tmpf <- tempfile(fileext = ".zip")
export_gtfs(gtfs, tmpf)
dir.exists(tmpf)
#> [1] FALSE

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export_gtfs() #10

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

export_gtfs() #10

dhersz Mar 16, 2021 Maintainer

Basic usage

Notes

Basic behaviour and handling auxiliary columns

Replies: 2 comments · 3 replies

rafapereirabr Mar 18, 2021 Collaborator

mpadge Mar 18, 2021 Collaborator

polettif Mar 19, 2021 Maintainer

dhersz Mar 19, 2021 Maintainer Author

dhersz Mar 19, 2021 Maintainer Author

dhersz
Mar 16, 2021
Maintainer

Replies: 2 comments 3 replies

rafapereirabr
Mar 18, 2021
Collaborator

mpadge Mar 18, 2021
Collaborator

polettif Mar 19, 2021
Maintainer

dhersz Mar 19, 2021
Maintainer Author

dhersz
Mar 19, 2021
Maintainer Author