Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor distributed query #558

Merged
merged 12 commits into from
Dec 12, 2024
Merged

Conversation

zhuoyuan-liu
Copy link
Contributor

@zhuoyuan-liu zhuoyuan-liu commented Nov 5, 2024

@javuto , This PR is not completed but shows the idea of how we move all the calculations from each node's distributed query request to the distributed query creation.

In this idea, we would create a new table that records which node should execute which query. In this case, we only iteration all nodes when we create the distributed query. When osquery sends the distributed query result, we only check this table and we don't need to go through the whole list of distributed queries.

I also preferred a small change:

  • Use a single status column instead of several ones. It would be simplicity and readability. Also, it would be efficient when querying the status, we don't have to write a query for each column. A classic example would be "active = ? AND completed = ? AND deleted = ? AND expired = ? AND type = ? AND environment_id = ?",

In this case, we can get rid of several tables, we only need one table to track the status of all distributed queries.

TODO after this merged:

  • Update the logic of completing a distributed query
  • Update the logic shown expected node and completed node
  • Remove several tables that are no longer needed

queries/queries.go Outdated Show resolved Hide resolved
@javuto javuto added refactor Refactorization of code queries On-demand queries related issues labels Nov 5, 2024
@javuto javuto changed the title Refactor distributed query WIP: Refactor distributed query Nov 7, 2024
@zhuoyuan-liu zhuoyuan-liu marked this pull request as draft November 13, 2024 15:50
@zhuoyuan-liu
Copy link
Contributor Author

zhuoyuan-liu commented Nov 14, 2024

@javuto Could you please take a look and give some feedback? It was tested in our dev cluster. All the core functions work well. However, there is a bug when using multiple tags: Currently, it uses union instead of intersection. i.e. If we use two tags: {env:dev, platform:ubuntu}, in the past, it should select nodes that have both tags, but now it will select all nodes that have either one of the tags. It will be fixed when we refactor the old logic and in the future, it will supports more query based on different tags. e.g. all nodes in dev environment but not running on the windows platform.

I tried to make this PR as small as possible and only replaced the way of creating new queries.

Here are the follow-up tasks after this PR:
Update the logic of completing a distributed query
Update the logic showing expected node and completed node
Remove several tables that are no longer needed

@zhuoyuan-liu zhuoyuan-liu marked this pull request as ready for review November 14, 2024 12:48
@zhuoyuan-liu zhuoyuan-liu changed the title WIP: Refactor distributed query Refactor distributed query Nov 14, 2024
queries/queries.go Outdated Show resolved Hide resolved
@javuto
Copy link
Collaborator

javuto commented Dec 9, 2024

Let's merge this and continue the refactor, nice job! 👏 🫡

@javuto javuto merged commit e8b9832 into jmpsec:main Dec 12, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
queries On-demand queries related issues refactor Refactorization of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants