Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation PR #151

Closed
wants to merge 21 commits into from
Closed

Documentation PR #151

wants to merge 21 commits into from

Conversation

zain-sohail
Copy link
Member

Let's all commit to this branch to create a comprehensive documentation for SED

@rettigl
Copy link
Member

rettigl commented Oct 6, 2023

This solution looks great, however in its current form it's difficult for me to test its output, because it is not actually published anywhere, and I cannot easilly locally generate it. Also, how do you picture this to be run in the future?
I would suggest to move the actual code into a separate .py file, and run the code from the docs-pipeline that you already defined for the main-branch (the one that currently updates the dependencies).

@zain-sohail
Copy link
Member Author

This solution looks great, however in its current form it's difficult for me to test its output, because it is not actually published anywhere, and I cannot easilly locally generate it. Also, how do you picture this to be run in the future?

Indeed. My plan to add this branch to the readthedocs options so we can check the output. Likely my script is not working right now, since first the files that are produced need to be committed to the branch and then the documentation produced. So still have to figure it out.

I would suggest to move the actual code into a separate .py file,

Yes think moving to another file is the best. I am not sure if using a bash script is just a better option, as that wouldn't require any specific environment.

and run the code from the docs-pipeline that you already defined for the main-branch (the one that currently updates the dependencies).

And also good idea, I will combine the two (also that one is not yet committing to branch).

@coveralls
Copy link
Collaborator

coveralls commented Oct 10, 2023

Pull Request Test Coverage Report for Build 6510731696

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 8 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+9.1%) to 99.666%

Files with Coverage Reduction New Missed Lines %
sed/loader/flash/loader.py 1 97.97%
sed/loader/base/loader.py 7 88.52%
Totals Coverage Status
Change from base Build 6483984354: 9.1%
Covered Lines: 4771
Relevant Lines: 4787

💛 - Coveralls

@zain-sohail
Copy link
Member Author

Updating the workflows to use cache. Much faster access. I need to set coveralls with the testing still.

Pylint has yet to be done.
Regarding combining the two documentation workflows, it's not possible since both rely on different criteria: one on the toml file changing and other on ipynb files changing.
The requirements generating workflow works perfectly now. The bot even commits the output directly to the repo.
Still figuring out the tutorials.. since it's not as trivial to run the notebooks and also nbconvert.

But anyways, all branch outputs can be checked on https://sed.readthedocs.io/en/documentation/ (so far no changes would be visible)

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

Updating the workflows to use cache. Much faster access. I need to set coveralls with the testing still.

As far as I understand this uses poetry and the lock file then, no? I would find it important to always use the most recent package versions during testing, that our package limits allow, to see potential problems with updated packages when they come. That's why I don't like having the lock file in the repo, because it will essentially be always outdated. Right now, its not being used by the workflows, so I don't care. If you intend to use it, we at least have to include a poetry update and poetry lock, and then I doubt it's faster...

@zain-sohail
Copy link
Member Author

cache doesn't have to use it and can use any package management tool.
But I think it's good time to start using poetry, since the package is entirely based on it.
The entire point of the lock file is that users who clone the repo can have the exact environment that we have in 3 commands. I don't see any point to keep updating the lock file for every minor release.
But even if you want to have an updated set of packages at all times, then it's even more critical that the lock file is the one that's updated. Otherwise we're testing for a system that the user or us developers might not even see.
The speedup will still be there because at least the python environment can be cached, and for when there's nothing to update in lock file.

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

Let's test it. I would not say the package is built on poetry. pyproject.yaml != Poetry. I dont' use poetry by default at all, onyl if I have to regenerate a poetry.lock file for whatever reason. This takes for me typically more than a minute to resolve dependencies, the installation of the environment takes <2 minutes. So I am not so sure about the speedup.
The goal is to have this as a package anyways, and then there will be no poetry and no lock file, but just the pyroject and the dependencies. So we need to make sure that at any point in time starting from there we get a working installation. That's my point of view and that's what we should test for.

@zain-sohail
Copy link
Member Author

Let's test it. I would not say the package is built on poetry. pyproject.yaml != Poetry. I dont' use poetry by default at all, onyl if I have to regenerate a poetry.lock file for whatever reason. This takes for me typically more than a minute to resolve dependencies, the installation of the environment takes <2 minutes. So I am not so sure about the speedup. The goal is to have this as a package anyways, and then there will be no poetry and no lock file, but just the pyroject and the dependencies. So we need to make sure that at any point in time starting from there we get a working installation. That's my point of view and that's what we should test for.

I initialized sed using poetry. Of course, poetry and setuptools are compatible with each other but the purpose of poetry is to manage dependencies and also aid in packaging. If you have any experience with building with setuptools, you'd know how much of a hassle it can be but poetry makes it easier. I'd encourage you to also adopt it.
Since we build using poetry, it only makes sense to test using the same environment too.

Regarding speedup, you can check here:
https://github.com/OpenCOMPES/sed/actions/runs/6477102441
of course, here I am not updating the lock file, which we can but I don't think necessary unless there is a change in pyproject.toml

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

Since we build using poetry, it only makes sense to test using the same environment too.

That does not matter. What we want is a package that we can pip install sed-processor
And that will only have a list of dependencies, and no poetry.lock file. And it will resolve these dependencies, and install the most recent versions of packages according to these dependencies, at the point in time a user installs the package.
With poetry.lock, on the other hand, you install an environment locked to package versions defined at the point in time you created the poetry.lock file, which will emmediately after be outdated. That's the problem I want to point out.
In order to make sure that the package works for the user who does pip install sed-processor, we need to test with the most recent packages our depedencies allow. And the package versions change independent of our pyproject.toml file!

@zain-sohail
Copy link
Member Author

That does not matter. What we want is a package that we can pip install sed-processor And that will only have a list of dependencies, and no poetry.lock file. And it will resolve these dependencies, and install the most recent versions of packages according to these dependencies, at the point in time a user installs the package. With poetry.lock, on the other hand, you install an environment locked to package versions defined at the point in time you created the poetry.lock file, which will emmediately after be outdated. That's the problem I want to point out. In order to make sure that the package works for the user who does pip install sed-processor, we need to test with the most recent packages our depedencies allow. And the package versions change independent of our pyproject.toml file!

Then wouldn't it make sense to update poetry and run those tests only when we publish a new version to pypi?
But even if you don't want to use poetry for testing, the cache can still be used on these commands

        git lfs pull
        python -m pip install --upgrade pip
        python -m pip install pytest coverage coveralls
        python -m pip install .

In this case, we can use the python setup action cache funtionality they added: https://github.com/actions/setup-python#caching-packages-dependencies

@zain-sohail
Copy link
Member Author

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

This probabaly makes a lot of sense. I am not at all against using caches, and even less against speedup, the only thing I wanted to point out is this potential caveat in using a static environment for testing. And no, it does not depend on us pushing a new version of the package, but others publishing new versions of packages we depend on. So even an old version of our package might break, if dependencies update. That we need to regularly test for, and react if we detect such a problem. That's the burdon of maintaining a complex python project...

@zain-sohail
Copy link
Member Author

FileNotFoundError: [Errno 2] No such file or directory: 'sed_config.yaml'
https://github.com/OpenCOMPES/sed/actions/runs/6496531345/job/17643665308#step:4:343

Getting this error with parallel testing. Only sometimes because 3.9 passed but not 3.8. Any ideas?

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

Classical race condition. These tests actually are written to be run sequenctually, because they all write to the same folder config sed_config.yaml file, and delete it afterwards. if one test does this before the other, the deletion does not work, and the tests will also for other reasons randomly fail.
We can work around this by specifying specific names for the local file for each of these tests, as also done for at least one of the other tests.

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

I'll provide a fix for it

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

That should fix the issue you had, however while doing local testing I encountered a similar problem with the flash loader, where the test files are also deleted. Here, I cannot change the file names, so I don't know how to resolve it.

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

The io-tests are also predestined for race conditions, as here also multiple tests access the same files.

@zain-sohail
Copy link
Member Author

That should fix the issue you had, however while doing local testing I encountered a similar problem with the flash loader, where the test files are also deleted. Here, I cannot change the file names, so I don't know how to resolve it.

In essence, they don't need to be deleted, at least in a workflow, because the runner deletes everything after it finishes running. For local case, I can understand your point.

The io-tests are also predestined for race conditions, as here also multiple tests access the same files.

That's true. Do we have many other such tests?

Also question, I noticed that our linting workflow also has testing. Is it there for a purpose?

@zain-sohail
Copy link
Member Author

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

In essence, they don't need to be deleted, at least in a workflow, because the runner deletes everything after it finishes running. For local case, I can understand your point.

They are deleted, because otherwise the conversion from h5 to parquet is not tested for the different cases, but the parquets are read afer the first test has run. But this again depends on the order of tests, which is undefined in a parallel setup, so again -> race condition. It comes from the somewhat different behavior of the flash loader compared to the others.

That's true. Do we have many other such tests?

I don't think so. The i/o tests can also be made safe by changing the file names depending on test fixture.

Also question, I noticed that our linting workflow also has testing. Is it there for a purpose?

The idea was to have this as the single workflow to run at every push, that verifies the integrity of the pushed code, which includes linting and tests. Maybe the name is not the best. The other test workflow was added to verify compatibility with different python versions.

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

This should remove all remaining file conflicts/race conditions. I tested with up to 60 workers several times and did not get any failures. ~10 workers seems to be an optimum with ~50 sec. testing time. Off course, they won't be available at the free nodes, I suppose.

@zain-sohail
Copy link
Member Author

This should remove all remaining file conflicts/race conditions. I tested with up to 60 workers several times and did not get any failures. ~10 workers seems to be an optimum with ~50 sec. testing time. Off course, they won't be available at the free nodes, I suppose.

The usual is 2 cores for linux systems, it seems. Though github has a feature to host your own runners:
https://github.com/organizations/OpenCOMPES/settings/actions/runners/new
Setup seems quite simple. If there's a server available, we could use it. But maybe too much hassle

@rettigl
Copy link
Member

rettigl commented Oct 12, 2023

This should remove all remaining file conflicts/race conditions. I tested with up to 60 workers several times and did not get any failures. ~10 workers seems to be an optimum with ~50 sec. testing time. Off course, they won't be available at the free nodes, I suppose.

The usual is 2 cores for linux systems, it seems. Though github has a feature to host your own runners: https://github.com/organizations/OpenCOMPES/settings/actions/runners/new Setup seems quite simple. If there's a server available, we could use it. But maybe too much hassle

Honestly its of not so much importance how long the online runs take if you do the tests locally before, so I would not bother. Speaking of which, looks like I forgot to check linting...

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@zain-sohail
Copy link
Member Author

PR deviated too much from intention. Hence I will close it and open up two new branches: one for CI/workflows and another for documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants