Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data validation checks to data analysis #34

Open
5 of 11 tasks
agendazhang opened this issue Nov 28, 2024 · 5 comments
Open
5 of 11 tasks

Add data validation checks to data analysis #34

agendazhang opened this issue Nov 28, 2024 · 5 comments

Comments

@agendazhang
Copy link
Collaborator

agendazhang commented Nov 28, 2024

  • Correct data file format
  • Correct column names
  • No empty observations
  • Missingness not beyond expected threshold
  • Correct data types in each column
  • No duplicate observations
  • No outlier or anomalous values
  • Correct category levels (i.e., no string mismatches or single values)
  • Target/response variable follows expected distribution
  • No anomalous correlations between target/response variable and features/explanatory variables
  • No anomalous correlations between features/explanatory variables
@ch3ch20h ch3ch20h assigned ch3ch20h and unassigned ch3ch20h Nov 28, 2024
@ch3ch20h
Copy link
Collaborator

I just did Correct data file format, and I pulled it to the branch data_validation. But there is a merge conflict. Could anyone check?

@agamsanghera
Copy link
Collaborator

I saw and commented on the PR, it is best to copy the work you did in a new notebook, and let the original analysis.ipynb be unchanged, good thing that you did it in a separate branch, this way the original work is safe

@diwanashita
Copy link
Collaborator

I will start on the new notebook and @ch3ch20h can continue in that notebook.

@diwanashita
Copy link
Collaborator

diwanashita commented Nov 29, 2024

@ch3ch20h the work pushed isn't on par with the rubric given to us. I am fixing this file to include the proper pre-processing and validation we learned from the lecture and readings.
Most of the work there looks like it is copy pasted from the analysis file, which isn't completely wrong, but it is missing the pre-processing part, rendering it ineffective for the data validation step.

@diwanashita
Copy link
Collaborator

Created Schema earlier in the day but when tried to run the file later tonight, it was not reading in the data for some reason. Will troubleshoot tmr morning unless someone knows why this is happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants