Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support labelled vectors #73

Open
jackobailey opened this issue Feb 13, 2021 · 7 comments
Open

Support labelled vectors #73

jackobailey opened this issue Feb 13, 2021 · 7 comments
Assignees

Comments

@jackobailey
Copy link

At the moment, readstata13 is the only package available that can write compressed Stata .dta files. But the package does not play nicely with labelled vectors from the tidyverse labelled package. Instead, it treats them as numeric and, thus, removes all value and variable labels when it writes them to disk.

Any chance that readstata13 could support labelled vectors?

@sjewo
Copy link
Owner

sjewo commented Feb 13, 2021

Could you give a small usage example? I have no experience with the labelled package.

@jackobailey
Copy link
Author

Of course, see below.

# Load packages

library(tidyverse)
library(labelled)
library(readstata13)
library(here)


# Create data frame with labelled vectors

dta <- 
  data.frame(
    lab_vct =
      sample(c(1:3, 999), 100, replace = T) %>% 
      labelled(
        labels =
          c(
            "a" = 1,
            "b" = 2,
            "c" = 3,
            "dk" = 999
          )
      )
  )


# Add variable label

var_label(dta$lab_vct) <- "Variable information here"


# Write to disk using save.dta13 (includes no variable/value labels)

save.dta13(data = dta, file = here("file.dta"), compress = T)

When we open the file in Stata we see that there are no variable or value labels:

Screenshot 2021-02-13 at 10 19 27

Screenshot 2021-02-13 at 10 19 45

Whereas if we save with write_dta from the haven package we can see them (though unfortunately cannot compress the Stata file, which can yield huge file sizes where data sets are large).

Screenshot 2021-02-13 at 10 20 19

Screenshot 2021-02-13 at 10 20 26

Finally, I would add that labelled vectors are often used when creating data for Stata as they allow one to specify specific label-value combinations (e.g. that "don't know" = 999) as opposed to factors numbering everything sequentially. They also allow you to add variable labels too.

@sjewo
Copy link
Owner

sjewo commented Feb 13, 2021

It might be a good idea to add some code for labelled columns. I want to add some functions for working with labels anyway.

In the meantime you could just prepare you data manually before exporting:

# Load packages

library(tidyverse)
library(labelled)
library(readstata13)
library(here)


# Create data frame with labelled vectors

dta <- 
  data.frame(
    lab_vct =
      sample(c(1:3, 999), 100, replace = T) %>% 
      labelled(
        labels =
          c(
            "a" = 1,
            "b" = 2,
            "c" = 3,
            "dk" = 999
          )
      )
  )

# Add variable label

var_label(dta$lab_vct) <- "Variable information here"


# Write to disk using save.dta13 (includes no variable/value labels)

save.dta13(data = dta, file = here("file.dta"), compress = T)

# Add another labelled variable
dta$num1 <- 1:100
var_label(dta$num1) <- "Numeric variable"

# Add a variable without label
dta$num2 <- 1:100

# Get vector of labelled variables
labeld_vars <- sapply(dta, is.labelled)

# And convert them to a factor
for(v in names(labeld_vars)[labeld_vars]) {
  dta[[v]] <- to_factor(dta[[v]])
}

# Get variable labels
var_labs <- var_label(dta)

# Replace missing labels with ""
var_labs[sapply(var_labs, is.null)] <- ""

# Unlist and order variable labels
var_labs <- unlist(var_labs)[names(dta)]

# Save variable labels as attribute
attr(dta, "var.labels") <- var_labs

# Save dta-file
save.dta13(data = dta, file = here("file_with_labels.dta"), compress = T)

@jackobailey
Copy link
Author

Thanks. I agree that supporting labelled vectors would be useful. While I could convert the labelled vectors to factors first, it's not useful in this case as it changes the numbers to be sequential. However, we need, for example, "Don't know" to be 999 in all cases and can't do this with factors that number each value label sequentially.

@sjewo
Copy link
Owner

sjewo commented Feb 13, 2021

I think it is also possible to keep the numeric codes. I’ll take a look into that.

@sjewo
Copy link
Owner

sjewo commented Feb 13, 2021

I did a rough implementation of labelled vectors in d8af207 . The code will probably change in the future, but you might give it a try:

# install from readstata13 from branch labels
remotes::install_github("sjewo/readstata13@label")

library(tidyverse)
library(labelled)
library(readstata13)
library(here)


# Create data frame with labelled vectors

dta <- 
  data.frame(
    lab_vct =
      sample(c(1:3, 999), 100, replace = T) %>% 
      labelled(
        labels =
          c(
            "a" = 1,
            "b" = 2,
            "c" = 3,
            "dk" = 999
          )
      )
  )

# Add variable label
var_label(dta$lab_vct) <- "Variable information here"

# Get variable labels
var_labs <- var_label(dta)

# Replace missing labels with ""
var_labs[sapply(var_labs, is.null)] <- ""

# Unlist and order variable labels
var_labs <- unlist(var_labs)[names(dta)]

# Save variable labels as attribute
attr(dta, "var.labels") <- var_labs 

# Write to disk using save.dta13 (includes now variable and value labels)
save.dta13(data = dta, file = here("file.dta"), compress = T)

@sjewo sjewo self-assigned this Feb 13, 2021
@dusadrian
Copy link

Thanks for the package. It would be great if (missing) values such as 999 (don't know) could be recoded into Stata missing values such as .a
Ideally, the user could supply the recoding rules: 999 = .a, 998 = .b etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants