Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data/] Apply the Data Directory Conventions #123

Closed
pishoyg opened this issue Aug 4, 2024 · 7 comments
Closed

[data/] Apply the Data Directory Conventions #123

pishoyg opened this issue Aug 4, 2024 · 7 comments
Labels
backlog What: Low-impact / low-priority dev Why: Developer experience

Comments

@pishoyg
Copy link
Owner

pishoyg commented Aug 4, 2024

data/raw/ is raw data.

data/input/ is data that we have modified or added.

data/output/ is the data produced by our pipeline. Each output format lives in a subdirectory of this directory.

@pishoyg pishoyg added p4 dev Why: Developer experience labels Aug 4, 2024
pishoyg added a commit that referenced this issue Aug 4, 2024
pishoyg added a commit that referenced this issue Aug 4, 2024
This is the convention throughout the repo.
pishoyg added a commit that referenced this issue Aug 4, 2024
@pishoyg pishoyg added p3 and removed p4 labels Aug 4, 2024
@pishoyg
Copy link
Owner Author

pishoyg commented Aug 4, 2024

Promoting to p3 since this is something that we want to do, it's not exactly in the backlog.

Note: Regarding output/, only KELLIA remains. The rest has outputs in one directory per format.

@pishoyg pishoyg changed the title Apply the Data Directory Conventions [data/] Apply the Data Directory Conventions Aug 4, 2024
@pishoyg pishoyg added P1 and removed p3 labels Aug 4, 2024
@pishoyg
Copy link
Owner Author

pishoyg commented Aug 5, 2024

TODO: Define data/raw/ and data/input/ directories for the Bible. This is needed for #131.

@pishoyg
Copy link
Owner Author

pishoyg commented Aug 5, 2024

Remaining TODO's:

  • Introduce data/input/ in copticsite.
  • Introduce data/input/ and data/raw in KELLIA.
  • Introduce data/input/ and data/raw in Crum.

@pishoyg pishoyg removed the p4 label Aug 8, 2024
@pishoyg pishoyg added this to the Improve the Developer Experience milestone Aug 9, 2024
pishoyg added a commit that referenced this issue Aug 15, 2024
@pishoyg
Copy link
Owner Author

pishoyg commented Aug 16, 2024

find . -type d -name data -not -path './archive/*'
./bible/stshenouda.org/data
./grammar/data
./flashcards/data
./dictionary/copticocc.org/data
./dictionary/marcion.sourceforge.net/data
./dictionary/kellia.uni-goettingen.de/data
./dictionary/copticsite.com/data
./site/data
./keyboard/data
ls -A
.DS_Store   .git                     README.md    coptic.egg-info   grammar           setup.py   test
.csslintrc  .gitignore               __pycache__  dictionary        keyboard          site       utils.py
.env        .pre-commit-config.yaml  archive      eslint.config.js  morphology        stats.sh
.env_INFO   Makefile                 bible        flashcards        requirements.txt  stats.tsv

@pishoyg
Copy link
Owner Author

pishoyg commented Aug 16, 2024

ls -d */
__pycache__/  archive/  bible/  coptic.egg-info/  dictionary/  flashcards/  grammar/  keyboard/  morphology/  site/  test/

pishoyg added a commit that referenced this issue Aug 16, 2024
There is absolutely no need to keep this in a separate directory.
pishoyg added a commit that referenced this issue Aug 16, 2024
This essentially reverts 72c791f and
5714157.

We no longer intend to use JSON as the input (as opposed to raw) format.
To make the pipeline simpler, we will use TSV for input.

We also don't intend to edit all languages. We will only edit Bohairic.
pishoyg added a commit that referenced this issue Aug 16, 2024
Dawoud scans now live in a new directory.
@pishoyg
Copy link
Owner Author

pishoyg commented Aug 16, 2024

Status:

TODO:

  • Enforce no code under data/, and no data outside of data/. You could do this using file extensions. No *.py under data/, and no *.tsv unless under it!

@pishoyg pishoyg modified the milestones: Developer Experience, Platform Aug 26, 2024
@pishoyg pishoyg added this to coptic Sep 11, 2024
@pishoyg pishoyg added the backlog What: Low-impact / low-priority label Feb 23, 2025
@pishoyg
Copy link
Owner Author

pishoyg commented Mar 3, 2025

Status:

Moved:

@pishoyg pishoyg closed this as completed Mar 3, 2025
@github-project-automation github-project-automation bot moved this to Done in coptic Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog What: Low-impact / low-priority dev Why: Developer experience
Projects
Archived in project
Development

No branches or pull requests

1 participant