Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of data collection #2056

Open
vvmruder opened this issue Sep 23, 2024 · 3 comments · May be fixed by #2059
Open

Improve performance of data collection #2056

vvmruder opened this issue Sep 23, 2024 · 3 comments · May be fixed by #2059
Assignees
Labels
discussion usergroup work This lable is used to mark issues which will be done by usergroup members

Comments

@vvmruder
Copy link
Collaborator

vvmruder commented Sep 23, 2024

Intro

Coming back to #1544 we can see that there are many redundant steps which are executed. This is/was mainly introduced to have a configurable server which can be altered at runtime. Means the data integrator can change content on the database without restarting the server. It turned out, that this usecase is a rare one. Probably it is not used at all. Instead, new data is provided by a regular deploy which means the underlying data and the server is completely re setup.

Initialising things once

One of the most interesting points which was not touched in the recent refactorings and performance improvements is the initialisation of the processor. It is initialized everytime an ÖREB related endpoint is called. If we agree on the statement made in the intro, it would be one of the most efficient performance catches if we refactor pyramid_oereb to initialize the processor only once at boot time. This would cut down all initilisation process which is done in this method.

=> this has to be discussed, as it is a organizational decision to make pyramid_oereb recognizing the configuration and datasorurces only at boot time and not on every request.

Parallelisation

I see on potential place where we could hook in for proper take advantage of parallel processing:
https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/readers/extract.py#L51-L104

Here all iterative querying to the sources is bundled and here we could take action. However we should discuss which technique we want to use and if this should be configurable.

asyncio

Since we build up onto recent python versions in this project, we are able to use asyncio in combination with SQLAlchemy. This is probably the best solution in terms of future proof setup. However it comes with some down sides. Asyncio is not 100% available in all python stack and libs we may depend on. So a major task would be to research where we might be blocked to use that.
The most up side of this solution is its scalability and the resource saving solution we would have.

multiprocessing / threading

A well known way of implementing iterative parallel tasks. We easily could implement that. The main disatvantage here is the forking. Threads in one solution or processes in the other, may introduce much more load onto the metal server in the end. So we should discuss how we can avoid bruteforcing wether the database or our servers in the end. In my opinion we could avoid that with some additional configuration where one can set the number of threads or processes to be allowed.

SQLAlchemy Session Management

A thing we also need to research, is the way we currently implement our session sharing:
https://github.com/openoereb/pyramid_oereb/blob/master/pyramid_oereb/core/adapter.py#L12-L73

It is some home made way to improve things for long time running servers to not collect too many open DB sessions. Currently Iam not aware of the influence that would have in a parallel context. Not for Threading NOR Processing NOR Asyncio.

@vvmruder vvmruder added usergroup work This lable is used to mark issues which will be done by usergroup members discussion labels Sep 23, 2024
@voisardf
Copy link
Collaborator

@vvmruder Thanks for the work, we will study and discuss the point in the PSC
@michmuel @svamaa

@voisardf
Copy link
Collaborator

voisardf commented Oct 4, 2024

@vvmruder After discussion in the PSC, could you provide an time estimate for the changes necessary to realise the 'Initialising things once' part above?
On our side we will check with the usergroup if everybody uses only the standard and interlis source configurations

@michmuel
Copy link
Collaborator

@vvmruder, @svamaa, @voisardf
We had some more discussion in the PSC concerning the task "initialising things once". It is important for us that routine operations such as updating data of particular themes or updating real estate information can be performed without a server restart. However, changes in configuration such as the change of the data source of a topic (database/database schema) may require a server restart.

@vvmruder vvmruder linked a pull request Oct 28, 2024 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion usergroup work This lable is used to mark issues which will be done by usergroup members
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants