Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical! to include Char #1282

Closed
Nosferican opened this issue Nov 21, 2017 · 4 comments · Fixed by #1918
Closed

Categorical! to include Char #1282

Nosferican opened this issue Nov 21, 2017 · 4 comments · Fixed by #1918

Comments

@Nosferican
Copy link
Contributor

The current "default" behavior is to make AbstractVector{T} where T <: AbstractString a CategoricalArrays.CategoricalVector and more specifically CategoricalArrays.CategoricalVector{T} where {T <: CategoricalArrays.CategoricalString, ...}

However, I believe that a better default would be to include columns that are eltype(col) <: Char as well.

@nalimilan
Copy link
Member

So, to rephrase, you suggest that eltype(eltype(categorical(["a"]))) give Char rather than String. That kind of makes sense, since that would be consistent with getindex. We would need to find another function to access the type of the wrapped element (but I guess it's mostly used internally, at least for now).

@Nosferican
Copy link
Contributor Author

Currently Categorical Arrays only treats String differently.

df = DataFrame(a = 'a':'z', b = vcat(repeat(["a","b","c","d","e"], inner = 5), "f"))
categorical!(df, :a)
categorical!(df)

That should be fine as for developing
eltype(df[:a]) <: CategoricalValue and
eltype(df[:b]) <: AbstractString.

My point is that categorical!(df) will only pool eltype(col) <: AbstractString. If one has a column with eltype(col) <: Char it won't be pooled by default. I find this strange as Char would most likely be used as a categorical variable. In essence, the current default is

function categorical!(df::DataFrame)
    for i in 1:size(df, 2)
        if eltype(df[i]) <: AbstractString
            df[i] = CategoricalVector(df[i])
        end
    end
    df
end

why not,

function categorical!(df::DataFrame)
    for i in 1:size(df, 2)
        T = eltype(df[i])
        if T <: AbstractString | T <: Char
            df[i] = CategoricalVector(df[i])
        end
    end
    df
end

I am unfamiliar if people actually use Char in dataframes, but as a robust feature I would include it in the default behavior.

@nalimilan
Copy link
Member

Ah, you refer to that issue. It's hard to decide which types should be considered as categorical and which types shouldn't. I'd tend to think we should either convert all types, or only strings by default. Any other combination would be too subtle (what about symbols? dates? custom types). We could add an argument which would default to AbstractString, and which could easily be set to Char or Any depending on your needs.

@cjprybol What do you think?

@cjprybol
Copy link
Contributor

We could add an argument which would default to AbstractString, and which could easily be set to Char or Any depending on your needs.

I think that sounds like a fantastic approach. Given that String vectors benefit the most (are the only vector types that benefit at all?) from being converted to CategoricalVectors, keeping that as the default makes sense. But if any users prefer to convert multiple column types to CategoricalVectors so they can interact with them using a consistent API, and given the great work that went into JuliaData/CategoricalArrays.jl#87 to support non-string values, enabling support on whatever column-types users want makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants