Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Search results aren't reader friendly #1618

Closed
shimizukawa opened this issue Jan 3, 2015 · 28 comments · Fixed by #4022
Closed

HTML Search results aren't reader friendly #1618

shimizukawa opened this issue Jan 3, 2015 · 28 comments · Fixed by #4022
Labels
html search type:enhancement enhance or introduce a new feature

Comments

@shimizukawa
Copy link
Member

The HTML built-in search is very useful, especially for offline help, but the results content isn't reader friendly.

For example, I get such content:

#!python

*************** Report *************** .. important:: For method, 

which isn't understandable by the average user.

As searchtools.js use files in _sources, which are a copy of the Rest sources, this is happening.
But the search only need text files I think, not real sources files.

By replacing the content of _sources with the output of sphinx to text, and when setting

#!python

text_sectionchars = '       '

, I get a better result:

#!python

Report Important: For method,

It is a lot better, even if the * of bold are still visible.

It would be great if this rendering to text is automated when doing the HTML rendering.


@shimizukawa shimizukawa added type:enhancement enhance or introduce a new feature prio:low html search labels Jan 3, 2015
@shimizukawa
Copy link
Member Author

From Andrea Cassioli on 2014-12-17 10:08:54+00:00

I totally agree, using the rest text as output is misleading and not nice at all!

@KacerCZ
Copy link

KacerCZ commented May 7, 2015

+1 for this feature
I was surprised that search is displaying source files in search result.
Display of rendered HTML content is something what users expect. For now using workaround with replacement of source files with text files.

@TimKam
Copy link
Member

TimKam commented May 8, 2016

The search results would be much better if one:

  • removed all markup (i.e. headings, images, bold and italic print) when building the .txt files.
  • built the .txt files as part of the normal html build and places them in the _sources directory of the build.

Is there any downside in doing this by default?

@lehmannro
Copy link
Contributor

I'm pretty sure lots of machinery in Sphinx assumes there's only ever one build happening per runtime, so doing text as part of html might be both, wasteful in terms of build time and complex to implement.

@TimKam
Copy link
Member

TimKam commented May 8, 2016

Thanks for the info. Regarding the build of the txt file, a small customization of the makefile can do the trick.

Still, it would be nice if one could produce more "search-friendly" .txts. What do you think?

Currently, we run some custom script to remove the remaining markup.
I suppose this might be a relatively common problem Sphinx users have.

@TimKam
Copy link
Member

TimKam commented Jul 31, 2016

In case this is a relevant issue for somebody else:

I wrote a - so far very basic - extension that fixes this issue and builds the search result snippets without markup.
GitHub: https://github.com/TimKam/sphinx-pretty-searchresults
PyPi: https://pypi.python.org/pypi/sphinxprettysearchresults

The extension should also provide a fix/workaround for issue #2369.

Of course, I welcome feedback & improvement suggestions.

@TimKam
Copy link
Member

TimKam commented Aug 23, 2017

There's one fairly simply way to fix this without adding additional build steps or output files. Currently, the search displays results snippets by requesting the corresponding source files from the server/local file system and extracting the text from them. It's possible to adjust this functionality so that it requests the HTML files instead of the source files. Then, it's fairly simple to extract the text from the HTML during client/browser runtime.
Of course, this increases load sizes and computing time a bit, but I don't think the change in performance is significant.

What do you think @tk0miya ? I'll add a PR (which probably needs some refinement/discussion) later.

This would make the pretty search results extension obsolete, which is a good thing in my opinion, because the messed up search results are a hard bug in the eyes of the users and it shouldn't be necessary to install an extension to fix a bug :-)

TimKam added a commit to TimKam/sphinx that referenced this issue Aug 23, 2017
request results as HTML instead of source files
retrieve preview snippet text from HTML
TimKam added a commit to TimKam/sphinx that referenced this issue Aug 23, 2017
request results as HTML instead of source files
retrieve preview snippet text from HTML
@dbogdanov
Copy link

Any updates on this issue?

@tstibbs
Copy link

tstibbs commented Jun 8, 2018

It would be useful to know the current state of this issue. It's very confusing for users, as the main sphinx docs themselves don't seem to have this issue (e.g. see http://www.sphinx-doc.org/en/master/search.html?q=sphinx&check_keywords=yes&area=default)

@TimKam
Copy link
Member

TimKam commented Jun 8, 2018

@tk0miya Could we have your opinions on this?

There are two alternatives to the approach I use in my PR (requesting the HTML):

  • Creating plain text files during build time, like in Sphinx: pretty search results: I don't think this is a good idea for standard Sphinx, because it increases the build time significantly.

  • Using regexps to remove the markup during run time: this is not elegant, either, but it at least feels light weight in comparison to loading the HTML files.

Can you live with any of these options?

@tk0miya
Copy link
Member

tk0miya commented Jun 20, 2018

I prefer to the first. Certainly, it increases build time. But it can remove markups perfectly, and also can support translation.
But, I know this way requires big refactoring of Sphinx core. So latter one is enough useful. And we can
improve our search much earlier than first way. The large problem of the second way is my skill. I'm not good at JavaScript. So I will not be able to maintain it. For this way, new maintainers are needed.

@TimKam
Copy link
Member

TimKam commented Jun 20, 2018

Okay, then I propose the following:

  • I add a PR that removes the markup (somewhat imperfectly) with regexps.
  • If you will want to merge it, but are concerned about its maintenance, I can take a limited/junior maintainer role with a focus on JavaScript and docs.

@tk0miya
Copy link
Member

tk0miya commented Jun 21, 2018

I just remembered @timhoffm had sent such script to us at #4857. It might be a good workaround for this problem. Could you check this please?

@shimizukawa What do you think about the workaround?

@tstibbs
Copy link

tstibbs commented Jun 22, 2018

Worth noting that both sphinx-doc.org and readthedocs.org seem to have fixed this problem already, so there are already solutions to this that are being used in anger. They're presumably happy with their solutions, so would understanding them help inform which route is the most sensible?

@TimKam
Copy link
Member

TimKam commented Jun 22, 2018

@tstibbs afaik sphinx-doc.org is hosted by ReadTheDocs. And ReadTheDocs provides a custom search back end (using Haystack and Elasticsearch. But for the average self-hosted Sphinx project, a search back end is presumably too much work and fixing the front end-only search in one of the ways I described is necessary.

@tk0miya I will take a look at #4857 asap and compare it to the tests I wrote for my sphinx-pretty-searchresults extension.

@pybride
Copy link

pybride commented Jun 22, 2018

As the original reporter of the issue, I would like that you don't forget that a search back end isn't always possible. In some cases, we need the documentation and the search to work offline, ie. the HTML directly opened in a browser on the same machine, without any web server (and also no Internet access).
Without this offline requirement, I would have since long switched to a better search engine.

@TimKam
Copy link
Member

TimKam commented Jun 22, 2018

I took a look at #4857 (regexp-parsing) and compared it to #4022 (using HTML snippets).
That's how the comparison looks like:

Regexp:
search_regexp_parsing

HTML:
search_html_snippets

IMHO, the regexp approach requires quite some additional work. The only disadvantage of the HTML approach is that it loads significantly more data, but I personally still think it's feasible (probably better than implementing a reStructuredText parser in JavaScript).

We could make this configurable (opt-out).

Any other opinions on this?

@timhoffm
Copy link
Contributor

timhoffm commented Jun 22, 2018

Good to see that this topic gets attention.

I just did a minimum amount of work in #4857 (regexp-parsing) to get something readable.
The regexp approach gets you 80% of the way with no or little additional work on the minimal parser. A full reStructuredText parser wouldn't help that much more. Though there seem to be libraries for that, you'd still have to interpret the generated document tree.

The HTML search has clearly an advantage because it operates on the target document. Can you quantify how much "significantly more data" is?

Depending on how large the difference in data is and how important we consider it to be, making this configurable would be a good way. Regexp parsing is a drop-in improvement on the current plain rst search with no disadvantage. If you need even better results and are willing to take the data overhead, use HTML-search. Which one should be the default may depend on the number s of the data overhead.

@TimKam
Copy link
Member

TimKam commented Jun 22, 2018

I compared the data load for a set of Sphinx documentation pages:

Page Content length HTML (in bytes) Content length regexp (in bytes) Ratio (HTML/regexp)
builders 62208 16729 3.72
config 203764 85442 2.38
ext/extlinks 11756 2342 5.02
theming 33809 17427 1.94
quickstart 37713 9577 3.94
Sum 349250 131517 2.66

The differences in ratio can be explained by the following factors:

  • As the HTML pages contain fixed overhead (the general layout) for all pages, the ratio is worse for small pages.
  • Some pages contain directives that instruct Sphinx to include content from other sources (in particular docstrings, I suppose). This content is not included in the .rst files, which is a major flaw of the regexp approach. Note that I did not include pages that contain includedirectives, because this would have rendered the comparison useless.

Considering this information, I suggest we use the HTML approach without any configuration to keep things simple. It's just text content; IMHO there is no need to optimize for data load.

@timhoffm
Copy link
Contributor

Thanks for digging into the numbers.

Considering all aspects, I'm fine with an HTML-only approach.

@TimKam
Copy link
Member

TimKam commented Jun 23, 2018

@tk0miya Do you agree? Then we could move forward with my PR.

@shimizukawa
Copy link
Member Author

+1 to the HTML-only approach. I think that the approach is sufficiently useful as workaround until Sphinx core has build output function for search display.

@TimKam
Copy link
Member

TimKam commented Jul 16, 2018

I resolved the conflicts in the original PR (which is close to a year old). Could someone do the review?

@tk0miya
Copy link
Member

tk0miya commented Jul 17, 2018

+1 for HTML approach. It can support other source formats (markdown and others).

TimKam added a commit to TimKam/sphinx that referenced this issue Jul 17, 2018
 Setting `html_copy_source` no longer affects search results
TimKam added a commit to TimKam/sphinx that referenced this issue Jul 17, 2018
@acsr
Copy link

acsr commented Jul 19, 2018

+1 for the HTML approach since a source code based solution will be almost useless as we currently get the context of a search in the translation as "source in the canonical language" - This was very disappointing. The Patch available here https://github.com/sphinx-doc/sphinx/pull/4022/files was working for us very well:

TimKam added a commit to TimKam/sphinx that referenced this issue Aug 28, 2018
TimKam added a commit that referenced this issue Aug 29, 2018
…friendly

#1618 make search results reader friendly
@tk0miya
Copy link
Member

tk0miya commented Aug 30, 2018

@TimKam Congrats! Thank you for your work!

@ranjith19
Copy link

ranjith19 commented Jan 21, 2019

I am using version 1.8.3 and I am still facing this issue. I tried the code on the master branch and it is fixed. When is version 2 planned for public release?

@TimKam
Copy link
Member

TimKam commented Jan 21, 2019

2.0 is planned for mid-end March (without a definite date), see: #5950. If you want the fix earlier, I recommend not to switch to master, but to instead adjust your template to include the updated version of searchtools.js, as changed here: https://github.com/sphinx-doc/sphinx/pull/4022/files#diff-71eb2d907f122b85744ef4c3390903cbR59

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
html search type:enhancement enhance or introduce a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.