Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor connector #319

Merged
merged 24 commits into from
Oct 5, 2023
Merged

Refactor connector #319

merged 24 commits into from
Oct 5, 2023

Conversation

jvillafanez
Copy link
Member

@jvillafanez jvillafanez commented Mar 22, 2023

New things coming in this PR:

  • Possible usage of multiple connectors when indexing data. This is intended to help with data migration from one index to another without downtime for the users
  • New connectors:
    • "Legacy" -> default one, matching the current behavior
    • "RelevanceV2" -> returns results based on the relevance of the document (to be detailed below)
  • Old commands are expected to affect to all the "write" connectors configured.
  • New "fillSecondary" command to fill the secondary index (the one we want to migrate to)

About the new "RelevanceV2" connector:

  • It will return results based on the score provided by elasticsearch:
    • Matching the whole word in the filename will score the whole points.
    • Matching the starts of the word in the filename will score half of the points (No longer applies)
    • Matching a part of the word in the filename will score a half of the points
    • If we're searching in the file content, we only match the whole word for the whole points.
  • The modification time will affect the end score of the document the following way.
    • Last modification older than 1 year -> score * 0.25
    • Last modification older than 1 month -> score * 0.5
    • Last modification older than 1 week -> score
    • Last modification older than 1 day -> score * 2
    • Modified in the last 24 hours -> score * 4

Note that, while the modification time affects the scoring, it doesn't mean that recent files will always be the first ones to appear. Old files might still have a higher score even after those boosts.

The "RelevanceV2" connector also introduces new ways to search for files based on the indexed fields. Note that the following info only applies to the "RelevanceV2" connector.

By default, the "RelevanceV2" connector will search in the name field, and in the file content if possible. Old limitations for the app to index the file content are still in place. Also note that, due to those limitations, big files might not get its contents indexed.

Additional searches you can do with the "RelevanceV2" connector:

  • Search by extension -> ext:pdf , ext:docx , ext:gif , ext:mp4 , ext:tar.gz , ext:gz , etc, any extension is possible
  • Search by size, only in bytes or megabytes:
    • Search by byte size -> size.b:<8092 , size.b:>102400, size.b:[8092 TO 16184]
    • Search by megabyte size -> size.mb:<3, size.mb:>9, size.mb:[3 TO 9]
  • Search by type: only "file" or "folder" -> type:file , type:folder
  • Search by date:
    • Search by timestamp -> mtime:<1678960862 , mtime:>1678960862 , mtime:[1608111372 TO 1678960862]
    • Search by date -> mtime:<2021-08-25 , mtime:>2023-01-18, mtime:[2022-01-01 TO 2022-12-31]
  • Search by mimetype -> mime:image , mime:gif , mime:text
    NOTE: To search for the whole mimetype such as "image/gif" use mime.key:image\/gif

By default, each search term will be joined with an "OR" operator. For example brown ext:pdf will be interpreted as "name or content containing brown OR extension = pdf", so "brown.txt" file and "tito.pdf" will appear in the results. You can use brown AND ext:pdf to match pdf files containing brown in the name or contents. (It doesn't works like this any longer)
Each search term will narrow the search. For example brown ext:pdf will be interpreted as "name or content containing brown AND extension = pdf", so "brown.pdf" and "a brown paper.pdf" will appear, but not "brown.txt" or "tito.pdf"

Some example of complex searches:

  • Files containing "confidential" updated since 2023 whose size is less than 10MB: confidential mtime:>2023-01-01 size.mb:<10
  • Folders containing more than 1GB: type:folder size.mb:>1024
  • Images between March and June 2020: mime:image mtime:[2020-03-01 TO 2020-06-30]
  • pdf or txt files containing "oxygen" or "helium": (oxygen OR helium) AND (ext:pdf OR ext:txt) (no lnoger applies)

Note that matching by name is pretty lax, so expect a bunch of unexpected results. Anyway, good results are expected to be on top.


Migrating to the "RelevanceV2" connector:

If you haven't indexed anything yet, you're encouraged to setup the connectors you want to use as part of the app configuration. The recommended one is "RelevanceV2" for write and search.

If you have indexed data, these are the steps to migrate to the new index.

  1. Let's assume you have the "Legacy" connector setup for write and search.
  2. Add the "RelevanceV2" connector to the list of write connectors. The list should have both "Legacy" and "RelevanceV2"
  3. Run the occ search:index:fillSecondary RelevanceV2 <user> command. The command needs to be run for all the users (or at least the ones using the search feature), and it's expected to take a lot of time.
  4. Once the indexed data have been migrated for all the users, you can switch the search connector to use the "RelevanceV2" one. From this point you can use the new features coming from the new connector.
  5. After checking everything is good, you can remove the old "Legacy" connector from the list of write connectors.
  6. Then you can completely remove the index from elasticsearch.

With step 2 you'll be writing in both indexes at the same time. This is expected to be slower.
Note that step 2 just takes care of new files. Files indexed previously won't be present in the new index. This is why step 3 is there.
Step 4 is important and you should stop at that point for a while. If something goes wrong, you can still revert things, in particular, you can switch back to the "Legacy" connector.
From step 5 the actions are irreversible. If you want to go back, you'll have to start a new migration.

It's important to notice there isn't any expected downtime while the migration happens.
Until step 4, the "Legacy" connector will keep updating the index normally. When the switch happens in the search connector, the new "RelevanceV2" connector will access to the new index, which should have been fully updated.


NOTE: This might not be the final version. This is mostly the state of the PR and things might change without notice. The official documentation is expected to contain the final information once this PR is completely finished.
@mmattel FYI this will need documentation.

@jvillafanez jvillafanez self-assigned this Mar 22, 2023
@mmattel
Copy link
Contributor

mmattel commented Mar 22, 2023

@jvillafanez great 👍 pls ping me again when close to merge so I can prepare the docs part.

@jvillafanez jvillafanez force-pushed the refactor_connector branch 2 times, most recently from 1762201 to 5096d00 Compare March 23, 2023 12:27
@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 4, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 134 Code Smells

0.0% 0.0% Coverage
5.5% 5.5% Duplication

@jvillafanez jvillafanez changed the title [WIP] Refactor connector Refactor connector Apr 4, 2023
@JRundfeldt
Copy link

these features seem to be too good to not merge them... @hodyroff I was just cleaning up and updating projects, when I found this change, which we apparently never merged. I do think our customers would benefit - maybe this is something for the 10.13?

@mmattel
Copy link
Contributor

mmattel commented May 17, 2023

@jvillafanez occ search:index:fillSecondary RelevanceV2 <user> with this PR, are there also core change(s) needed?

@jvillafanez
Copy link
Member Author

No core changes are needed. Everything is part of the app.

@mmattel
Copy link
Contributor

mmattel commented May 17, 2023

No core changes are needed. Everything is part of the app.

In this case, just to be noted, an app upgrade could be made to be regulary downloaded from the marketplace making the change availabe asap plus adding it to the default for the 10.13 release. There is no doc restriction going that path.

@cdamken cdamken requested review from IljaN, pako81 and jnweiger July 4, 2023 16:28
@cdamken
Copy link

cdamken commented Jul 4, 2023

@jnweiger could we do some QA here?

@jnweiger jnweiger added the QA:todo This issue is ready for QA label Jul 5, 2023
@jnweiger jnweiger mentioned this pull request Jul 5, 2023
46 tasks
@jnweiger jnweiger requested a review from mrow4a July 6, 2023 08:12
nirajacharya2 and others added 2 commits July 7, 2023 16:41
@jnweiger
Copy link
Contributor

could we do some QA here?

@jnweiger could we do some QA here?

Please review and merge into release-2.4.0 branch, I''ll build a 2.4.0-rc.1 from there and then start QA.

@jvillafanez
Copy link
Member Author

I don't see a release-2.4.0 branch, so I guess it will be created after this PR is merged.

@jnweiger jnweiger changed the base branch from master to release-2.4.0 July 13, 2023 07:58
@jnweiger
Copy link
Contributor

It wasn't pushed. Sorry. Pushed now, and retargetted this PR.

@jnweiger
Copy link
Contributor

Copy link
Contributor

@jnweiger jnweiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened some issues for dubious search results #323 (comment)

I'd suggest to rename connectors_search to connector_search

I'd suggest to make all new keywords upper case. SIZE, EXT, MTIME, TYPE, MIME

size.b and size.mb is supported. size.kb is missing.

Migration step 6 is unclear: "Then you can completely remove the index from elasticsearch."
-> search:index:reset ? or some 'drop table' SQL in the elastic server?

@jnweiger jnweiger mentioned this pull request Jul 25, 2023
11 tasks
@jnweiger
Copy link
Contributor

jnweiger commented Jul 25, 2023

Screenshot for @mmattel Admin -> Settings -> Search
grafik

@jvillafanez
Copy link
Member Author

Latest commit has changed the behavior of the search. The initial post has been updated to reflect those changes.

@jvillafanez
Copy link
Member Author

I'd suggest to make all new keywords upper case. SIZE, EXT, MTIME, TYPE, MIME

This will need some voting... I'd rather keep them lowercase 😄

size.b and size.mb is supported. size.kb is missing.

We'd need to send that info because as far as I know, elasticsearch doesn't make any calculations (or it's complex to setup). As far as elasticsearch is concerned, you can send size.b = 50 and size.mb = 70 and those values will be stored so that's a potential problem we have. That's why I'd rather store the minimum information possible. Furthermore, the size.gb is also missing for the same reason.

Migration step 6 is unclear: "Then you can completely remove the index from elasticsearch."
-> search:index:reset ? or some 'drop table' SQL in the elastic server?

It's mainly for elasticsearch maintenance, to remove data that won't be used any longer. It's possible to skip that step without any problem on our side, but you have to live with junk data in elasticsearch.

@jnweiger
Copy link
Contributor

Migration step 6 is unclear: "Then you can completely remove the index from elasticsearch."
-> search:index:reset ? or some 'drop table' SQL in the elastic server?

It's mainly for elasticsearch maintenance, to remove data that won't be used any longer. It's possible to skip that step without any problem on our side, but you have to live with junk data in elasticsearch.

Understood. Instructions, how to exactly remove this junk data is missing.

@jnweiger
Copy link
Contributor

Most of the problems are fixed in the current state of #319 but I believe there is one regression now
#331 (comment)

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 19 Code Smells

0.0% 0.0% Coverage
5.5% 5.5% Duplication

@jnweiger jnweiger merged commit 4c472a4 into release-2.4.0 Oct 5, 2023
@delete-merged-branch delete-merged-branch bot deleted the refactor_connector branch October 5, 2023 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QA:todo This issue is ready for QA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants