Skip to content
This repository has been archived by the owner on Jun 30, 2023. It is now read-only.

[BUG] bind_tweets(): 'Column id doesn't exist.' with empty data_.json #304

Open
4 of 5 tasks
TimBMK opened this issue Mar 10, 2022 · 6 comments
Open
4 of 5 tasks

Comments

@TimBMK
Copy link
Contributor

TimBMK commented Mar 10, 2022

Please confirm the following

  • I have searched the existing issues
  • The behaviour of the program is deviated from what is described in the documentation.
  • I can reproduce this problem for more than one time.
  • This is NOT a 3-digit error -- it does not display an error message like something went wrong. Status code: 400.
  • This is a 3-digit error and I have consulted the Understanding API errors vignette and the suggestions do not help.

Describe the bug

As soon as there is a .json file in the datapath of bind_tweets without an ID ("data_.json"), the function fails with an error if set to the "tidy" format. Generating the "raw" format, however, is not an issue. The following error occures:

Error in `stop_subscript()`:
! Can't rename columns that don't exist.
x Column `id` doesn't exist.
Backtrace:
  1. academictwitteR::bind_tweets(data_path = "data/2017", output_format = "tidy")
  9. dplyr:::rename.data.frame(., pki = tidyselect::all_of(pkicol))
 10. tidyselect::eval_rename(expr(c(...)), .data)
 11. tidyselect:::rename_impl(...)
 12. tidyselect:::eval_select_impl(...)
 21. tidyselect:::vars_select_eval(...)
 22. tidyselect:::walk_data_tree(expr, data_mask, context_mask, error_call)
 23. tidyselect:::eval_c(expr, data_mask, context_mask)
 24. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
 25. tidyselect:::walk_data_tree(new, data_mask, context_mask)
 26. tidyselect:::as_indices_sel_impl(...)
 27. tidyselect:::as_indices_impl(x, vars, call = call, strict = strict)
 28. tidyselect:::chr_as_locations(x, vars, call = call)
 29. vctrs::vec_as_location(x, n = length(vars), names = vars)
 30. vctrs `<fn>`()
 31. vctrs:::stop_subscript_oob(...)
 32. vctrs:::stop_subscript(...)
Run `rlang::last_trace()` to see the full context.

The data_.json is usually an empty file, but it seems to get generated whenever native academictwitteR functions do not return any twitter data (empty pages). The last three times I used get_user_timeline(), I ended up with these empty files. Deleting the data_.json file fixes the error. Furthermore, I believe the problem only started occuring after I updated academictwitteR to 0.3.1. I don't think it occured under 0.2.1.

Expected Behavior

I would suggest some sort of failsafe that automatically skips .json files without the ID, as they seem to be empty anyways.

Steps To Reproduce

users <- c("303234771", "2821282972", "84803032", "154096311", "2615232002", "37776042", "2282315483", "405599246", "1060861584938057728", "85161049")

get_user_timeline(x = users,
                  start_tweets = "2017-04-01T00:00:00Z",
                  end_tweets = "2017-06-01T00:00:00Z",
                  bearer_token = bearer_token,
                  n = 3200,
                  data_path = "data/test",
                  bind_tweets = F) 

list.files("data/test")
[1] "data_.json"                    "data_848204306566320128.json"  "data_848950153520218113.json"  "users_.json"                   "users_848204306566320128.json"
[6] "users_848950153520218113.json"

data <- bind_tweets(data_path = "data/test", output_format = "tidy")

data_raw <- bind_tweets(data_path = "data/test", output_format = "raw")

Environment

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] academictwitteR_0.3.1

loaded via a namespace (and not attached):
 [1] fansi_1.0.2      assertthat_0.2.1 utf8_1.2.2       crayon_1.5.0     dplyr_1.0.8      R6_2.5.1         jsonlite_1.8.0   DBI_1.1.2        lifecycle_1.0.1  magrittr_2.0.2  
[11] pillar_1.7.0     rlang_1.0.1      cli_3.2.0        rstudioapi_0.13  fs_1.5.2         vctrs_0.3.8      generics_0.1.2   ellipsis_0.3.2   tools_4.1.2      glue_1.6.2      
[21] purrr_0.3.4      compiler_4.1.2   pkgconfig_2.0.3  tidyselect_1.1.2 tibble_3.1.6     usethis_2.1.5

Anything else?

Possibly related to #218

@chainsawriot
Copy link
Collaborator

@TimBMK Thanks for reporting the bug. I can reproduce this.

require(academictwitteR)
#> Loading required package: academictwitteR
users <- c("303234771", "2821282972", "84803032", "154096311", "2615232002", "37776042", "2282315483", "405599246", "1060861584938057728", "85161049")

tempdir <- academictwitteR:::.gen_random_dir()

get_user_timeline(x = users,
                  start_tweets = "2017-04-01T00:00:00Z",
                  end_tweets = "2017-06-01T00:00:00Z",
                  n = 3200,
                  data_path = tempdir,
                  bind_tweets = FALSE,
                  verbose = FALSE)
#> data frame with 0 columns and 0 rows

list.files(tempdir)
#> [1] "data_.json"                    "data_848204306566320128.json" 
#> [3] "data_848950153520218113.json"  "query"                        
#> [5] "users_.json"                   "users_848204306566320128.json"
#> [7] "users_848950153520218113.json"
data <- bind_tweets(data_path = tempdir, output_format = "tidy")
#> Error in `stop_subscript()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `id` doesn't exist.
data_raw <- bind_tweets(data_path = tempdir, output_format = "raw")

Created on 2022-03-10 by the reprex package (v2.0.1)

There are actually two issues here:

  1. get_user_timeline shouldn't generate those empty json files in the first place.
  2. bind_tweets can't handle those empty json files.

@chainsawriot
Copy link
Collaborator

@TimBMK I will keep this issue focusing on only the second issue. And I will open another issue related to the first one.

@psalmuel19
Copy link

Hello,

This worked for me:
batch_four <- bind_tweets('data' user = FALSE, verbose = TRUE, output_format = "raw")

but when trying to convert to csv with:
write.csv(batch_four, 'batch_4.csv')

I get this error:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 538, 519, 575, 392, 190, 1, 282, 603, 111

@TimBMK
Copy link
Contributor Author

TimBMK commented Jul 26, 2022

@psalmuel19 this is unrelated to the issue mentioned above, as it is clearly caused by write.csv() rather than the bind_tweets() function. I suspect the nested lists in the raw data format to cause problems. Try unnesting batch_four or use output_format = "tidy" when binding the tweets. If the issue persists, please open a seperate issue.

@psalmuel19
Copy link

@TimBMK
I should have mentioned that did that and got the error below:
batch_four <- bind_tweets('data', user = FALSE, verbose = TRUE, output_format = "tidy")
Error in chr_as_locations():
! Can't rename columns that don't exist.
✖ Column id doesn't exist.

While searching for a solution, I came across the output_format = "raw" code. It worked in binding but I now can't convert to csv. Any suggestions please?

@TimBMK
Copy link
Contributor Author

TimBMK commented Jul 27, 2022

As mentioned in the original post, the easiest fix to get the tidy format to work is to go into the folder with the data and manually delete the empty "data_.json" files. This fixes the issue with the tidy format, as the issue with the non-existent id column does not come up.

The raw format does not output a dataframe, but a list of tibbles (a type of dataframe) of different length containing different information (this is what the API returns originally). If you are set on using the raw format, you will have to decide what information you want to export to .csv. If you look at the structure of the raw data object (batch_four in your case), it is relatively self-explanatory what you get in each of the tibbles. An easy way to do this yourself is with
names(batch_four)
In order to export the data, you can write the tibbles by referencing them explicitly, e.g.
write.csv(batch_four$tweet.main, file = "batch_4.csv")
tweet.main contains the main information of the tweet; additional information (e.g. metrics) would need to be matched together. You can use dplyr's left_join() function for this and use the tweet_id as an indicator for matching. As I mentioned above, however, removing the problematic files by hand will enable the tidy format, which gives you all relevant data in a neat and ready-made format.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants