Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataEntry should be refactored #520

Closed
GearsAD opened this issue Jul 3, 2020 · 11 comments · Fixed by #550
Closed

DataEntry should be refactored #520

GearsAD opened this issue Jul 3, 2020 · 11 comments · Fixed by #550
Milestone

Comments

@GearsAD
Copy link
Collaborator

GearsAD commented Jul 3, 2020

For 0.9 we should refactor the data entries to include more information.

The proposed data entry structure that every data source (e.g. MongoDB) should implement is:

struct AbstractDataEntry
  label::Symbol # Formerly key 
  id::UUID
  origin::String # E.g. user|robot|session|label
  description::String
  mimeType::String
  hash::String # Probably https://docs.julialang.org/en/v1/stdlib/SHA
  createdTimestamp::DateTime
end  

We can't actually implement this in AbstractDataEntry, but we should enforce that these fields are populated.

JT EDIT: I just swapped label and id for easier deprecation.
JT EDIT: source -> origin from slack comment
JT EDIT: added createdTimestamp

@dehann
Copy link
Member

dehann commented Jul 3, 2020

i would suggest we don’t call this AbstractDataEntry, since Abstract has specific meaning in our general dispatch use. Why not just DataEntry and it can derive from AbstractDataEntry if that abstraction is required?

@dehann
Copy link
Member

dehann commented Jul 3, 2020

rest looks good to me!

@GearsAD GearsAD added this to the v0.9.0 milestone Jul 3, 2020
@GearsAD
Copy link
Collaborator Author

GearsAD commented Jul 3, 2020

Agreed, that was more of a conceptual design.

In practice these fields would have to be populated in GeneralBigDataEntry, MongodbBigDataEntry, and FileBigDataEntry.

@dehann
Copy link
Member

dehann commented Jul 5, 2020

guessing you mean “Data” :-)

@Affie
Copy link
Member

Affie commented Jul 7, 2020

xref #97 (comment)

@Affie
Copy link
Member

Affie commented Jul 9, 2020

What hashing function do we want to use and will it be fixed to only one?
Julia is currently faster on SHA2

@Affie
Copy link
Member

Affie commented Jul 9, 2020

Do we want a timestamp?
The variable has a timestamp that should technically be the same (or close to) the data timestamp.
Perhaps it can help with data integrity.

@GearsAD
Copy link
Collaborator Author

GearsAD commented Jul 13, 2020

SHA2 is good I think?
The timestamp is more of a database-level entry, just to keep track of when the data was actually created, the timestamp in the variable is still the authority on the variable.

@GearsAD
Copy link
Collaborator Author

GearsAD commented Jul 13, 2020

Notes from Slack:

  • If we key the data stores (or ID them) then we can take the store-specific info out of FileDataEntry and the connection information out of MongodbDataEntry. We can keep that info inside the store itself and relate it by the datastorekey.
  • Can we rename source to maybe origin or graphOrigin because source is a very loaded term?
  • Hash is base64 encoded string
  • We'll (optionally) use datasource ID to identify the different stores, which we'll save into the entries
  • Users are responsible for managing the different stores
  • Data is immutable, so we don't have to worry about updateTimestamp or updateData()
  • Line 115 with addData!(dfg::AbstractDFG, dataStore::AbstractDataStore, variableLabel::Symbol, dataLabel::Symbol, blob::Vector{UInt8}, [default parameters like description etc.]) will return the entry when you create the data (really like that way of working with the API as you have it there)
  • SmallData is a special data entry as you've got there, of type Dict{Symbol, Union{Int, Float64, AbstractString, Vector{Int}, Vector{Float64}, Vector{AbstractString}} or something similar
  • SmallData is a special DataEntry, but it's only really used in CGDFG for saving/loading. A user won't see it, the DFGVariable will contain the Dict like above

@dehann
Copy link
Member

dehann commented Jul 13, 2020

I have a suggestion on this:

We'll (optionally) use datasource ID to identify the different stores, which we'll save into the entries

The idea behind immutable id => blob pairs is that the location of the blob does not matter. in fact its highly likely that there will be multiple copies of the blob flying all over the place (hence the immutable requirement), so it seems little awkward to store a single blob store location into the Entry?

perhaps i’m understanding wrong and we should just make that bullet point a bit more explicit.

@dehann
Copy link
Member

dehann commented Jul 19, 2020

targeting v0.9 or v0.10 now that we pushing for nullhypo stuff in v0.9?

@Affie Affie modified the milestones: v0.9.0, v0.10.0 Jul 19, 2020
@Affie Affie linked a pull request Jul 28, 2020 that will close this issue
6 tasks
@Affie Affie closed this as completed Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants