-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.0] Make the tuple length of com_finder configurable #20384
Conversation
I'd be opposed to this change as it is removing a feature that is required on some sites. A better solution would be to make the number of words per tuple a configurable parameter of the indexer (default to 1 for new installations). |
Making this configurable would mean yet another option and quite a bit more code for something that 98% out there don't need. At the same time I think that smart search is not and can not be the ultimate site search engine, simply because we are writing a CMS and not a search engine. This component is a feature for all those sites out there that need a slightly better search than com_search, but if you need something really sophisticated, use a pro tool like Elastic Search. So I'm advocating bit less complexity here. Right now this component is so complex that no one really worked out the issues in the last 5 years. Don't force this to stay abandonware. |
Just to list all current PRs for Finder/Smart Search:
|
I'm here with @chrisdavenport. This is a feature I like. Perhaps we can think of a new option to enable/disable tuples. But I would not remove the functionality. I'm sure with some use of the |
7a5ab32
to
c26f898
Compare
I've now made this configurable. |
…4finder_termsphrases
@@ -168,33 +174,28 @@ public static function tokenize($input, $lang, $phrase = false) | |||
$tokens[] = new FinderIndexerToken($terms[$i], $lang); | |||
} | |||
|
|||
// Create two and three word phrase tokens from the individual words. | |||
for ($i = 0, $n = count($tokens); $i < $n; $i++) | |||
// Create multi-word phrases tokens from the individual words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phrase not phrases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -40,6 +40,7 @@ COM_FINDER_CONFIG_STEMMER_FR="French Only" | |||
COM_FINDER_CONFIG_STEMMER_LABEL="Select Language Stemmer" | |||
COM_FINDER_CONFIG_STEMMER_PORTER_EN="English Only" | |||
COM_FINDER_CONFIG_STEMMER_SNOWBALL="Snowball" | |||
COM_FINDER_CONFIG_TUPLECOUNT_LABEL="Length indexed tuples" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really hard for me to say i this label is correct or not as I dont understand it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I have no idea how to phrase this short and elegantly. This is the setting how Joomla should index the text. If you set this to 1, then each word is just indexed separately. If you set it to something > 1, it also indexes every tuple of length 2 to $tuplecount.
Practical example: Text is
Joomla is a good CMS for everything
Then Joomla would index the following tuples for a setting of 3:
- Joomla
- is
- a
- good
- CMS
- for
- everything
- Joomla is
- is a
- a good
- good CMS
- CMS for
- for everything
- Joomla is a
- is a good
- a good CMS
- good CMS for
- CMS for everything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply label it with
Tuple Length
or
Maximum Tuple Length
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve language strings
@@ -40,6 +40,7 @@ COM_FINDER_CONFIG_STEMMER_FR="French Only" | |||
COM_FINDER_CONFIG_STEMMER_LABEL="Select Language Stemmer" | |||
COM_FINDER_CONFIG_STEMMER_PORTER_EN="English Only" | |||
COM_FINDER_CONFIG_STEMMER_SNOWBALL="Snowball" | |||
COM_FINDER_CONFIG_TUPLECOUNT_LABEL="Tuple Length" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if "Words Per Phrase" might be a little clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's correct then it's better. Although to be honest all the terms here are pretty meaningless. Snowball? Stemmer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the options to include an "advanced" switch that hides those options...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
FYI I looked all over for a wording in French for |
I changed the wording. I would put the change with the advanced option for the configuration in another PR, since there are a few changes in the other PRs for com_finder. |
WITHOUT THE PR (Index perform) WITH THE PR (Index perform) WITHOUT THE PR (Search perform) WITH THE PR (Search perform) IMO Sacrifice the ability to search for a specific phrase, if with this we obtain a better performance, worth it. It may be that after we have all these PRs inside, we can find a way to return this ability, or something similar. |
Also, I notice, while I try search, the focus of the input gone, and I needed to click on the input again to continue writing. Some JS wrong? need check the console. Anyway, this is not related to this PR |
I have tested this item ✅ successfully on 0e20307 This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20384. |
I added separate translation strings for each search feature and an intro and outro string. |
will test asap |
@carlitorweb @infograf768 would you have a few minutes to test this again? |
Installed the sample demo, searched
|
@Quy EDIT: One should be able to also obtain the right results with |
I have fatal error here before and after patch Looking at the file I see that |
After correcting this locally, all works fine. |
PR for the Fatal Error : #21016 |
@Quy This is an interesting error. I just tested this and this is also present in 3.x. Can we merge/test this one without fixing that bug here and now? I will provide another PR to fix this soon. Just want this one merged in... |
@infograf768 thanks for your PR. 😄 |
How may using a wrong format be a code error? |
using a "wrong" format should indeed not be a big issue. In worst case you get no result at all. But it should NEVER throw an error like it does right now. @Quy might not get the result that he is looking for, but he definitely should not get the warnings that he got right now. That is an issue that we also have to fix in 3.x and you could consider that a (low-level) security issue. It is an information disclosure. |
|
I have tested this item ✅ successfully on 5f21dc9 This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20384. |
@carlitorweb can you please retest? |
I have tested this item ✅ successfully on 5f21dc9 This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20384. |
Ready to Commit after two successful tests. |
Thank you! |
Thankyou! |
[UPDATE] After the initial discussion, I changed this to make the tuple length of com_finder be configurable. This should make this less controversial.
Original text for the old PR
Finder is indexing not only every word, but also every combination of 2- and 3-word-tuples in the text to be indexed. This results in gigantic storage consumption by the index and a performance decrease when indexing of about 20%. At the same time, it only marginally improves the search results. When removing this feature, we are sacrificing the possibility to search for specific phrases of up to 3 terms (like "Joomla is great") for a reduction in necessary storage space of about 2/3. (and the performance increase in indexing)
How to test
Install 4.0-dev and go to the backend of Smart Search. Enable the Smart Search Plugin and start the indexing process. Notice that it takes a long time (preferably measure the time with a stopwatch). Go to the frontend and open the Smart Search menu item. Search for something and notice the results.
Now apply these changes and go into the backend of Smart Search. Clear the index and rebuild the index, again checking with a stopwatch. You should see a 20% performance increase here. Now go to the frontend and do the same search. You should get the same result.
Do a search for a specific phrase, wrapping the phrase in quotes. See that this does not work anymore.
Please also see #20185 for further improvements for com_finder.
I want to thank cloudaccess.net for supporting these changes. Without their generous help, this wouldn't be possible.