HTML Search results aren't reader friendly #1618

shimizukawa · 2015-01-03T11:42:05Z

The HTML built-in search is very useful, especially for offline help, but the results content isn't reader friendly.

For example, I get such content:

#!python

*************** Report *************** .. important:: For method,

which isn't understandable by the average user.

As searchtools.js use files in _sources, which are a copy of the Rest sources, this is happening.
But the search only need text files I think, not real sources files.

By replacing the content of _sources with the output of sphinx to text, and when setting

#!python

text_sectionchars = '       '

, I get a better result:

#!python

Report Important: For method,

It is a lot better, even if the * of bold are still visible.

It would be great if this rendering to text is automated when doing the HTML rendering.

Bitbucket: https://bitbucket.org/birkenfeld/sphinx/issue/1618
Originally reported by: C_PYB
Originally created at: 2014-11-06T18:07:35.870

The text was updated successfully, but these errors were encountered:

shimizukawa · 2015-01-03T11:42:07Z

From Andrea Cassioli on 2014-12-17 10:08:54+00:00

I totally agree, using the rest text as output is misleading and not nice at all!

KacerCZ · 2015-05-07T09:22:34Z

+1 for this feature
I was surprised that search is displaying source files in search result.
Display of rendered HTML content is something what users expect. For now using workaround with replacement of source files with text files.

TimKam · 2016-05-08T16:12:02Z

The search results would be much better if one:

removed all markup (i.e. headings, images, bold and italic print) when building the .txt files.
built the .txt files as part of the normal html build and places them in the _sources directory of the build.

Is there any downside in doing this by default?

lehmannro · 2016-05-08T17:27:38Z

I'm pretty sure lots of machinery in Sphinx assumes there's only ever one build happening per runtime, so doing text as part of html might be both, wasteful in terms of build time and complex to implement.

TimKam · 2016-05-08T17:46:58Z

Thanks for the info. Regarding the build of the txt file, a small customization of the makefile can do the trick.

Still, it would be nice if one could produce more "search-friendly" .txts. What do you think?

Currently, we run some custom script to remove the remaining markup.
I suppose this might be a relatively common problem Sphinx users have.

TimKam · 2016-07-31T19:15:01Z

In case this is a relevant issue for somebody else:

I wrote a - so far very basic - extension that fixes this issue and builds the search result snippets without markup.
GitHub: https://github.com/TimKam/sphinx-pretty-searchresults
PyPi: https://pypi.python.org/pypi/sphinxprettysearchresults

The extension should also provide a fix/workaround for issue #2369.

Of course, I welcome feedback & improvement suggestions.

…and Opera

TimKam · 2017-08-23T18:35:32Z

There's one fairly simply way to fix this without adding additional build steps or output files. Currently, the search displays results snippets by requesting the corresponding source files from the server/local file system and extracting the text from them. It's possible to adjust this functionality so that it requests the HTML files instead of the source files. Then, it's fairly simple to extract the text from the HTML during client/browser runtime.
Of course, this increases load sizes and computing time a bit, but I don't think the change in performance is significant.

What do you think @tk0miya ? I'll add a PR (which probably needs some refinement/discussion) later.

This would make the pretty search results extension obsolete, which is a good thing in my opinion, because the messed up search results are a hard bug in the eyes of the users and it shouldn't be necessary to install an extension to fix a bug :-)

request results as HTML instead of source files retrieve preview snippet text from HTML

dbogdanov · 2017-09-29T14:55:52Z

Any updates on this issue?

tstibbs · 2018-06-08T10:09:55Z

It would be useful to know the current state of this issue. It's very confusing for users, as the main sphinx docs themselves don't seem to have this issue (e.g. see http://www.sphinx-doc.org/en/master/search.html?q=sphinx&check_keywords=yes&area=default)

TimKam · 2018-06-08T19:40:37Z

@tk0miya Could we have your opinions on this?

There are two alternatives to the approach I use in my PR (requesting the HTML):

Creating plain text files during build time, like in Sphinx: pretty search results: I don't think this is a good idea for standard Sphinx, because it increases the build time significantly.
Using regexps to remove the markup during run time: this is not elegant, either, but it at least feels light weight in comparison to loading the HTML files.

Can you live with any of these options?

tk0miya · 2018-06-20T16:26:22Z

I prefer to the first. Certainly, it increases build time. But it can remove markups perfectly, and also can support translation.
But, I know this way requires big refactoring of Sphinx core. So latter one is enough useful. And we can
improve our search much earlier than first way. The large problem of the second way is my skill. I'm not good at JavaScript. So I will not be able to maintain it. For this way, new maintainers are needed.

TimKam · 2018-06-20T18:33:51Z

Okay, then I propose the following:

I add a PR that removes the markup (somewhat imperfectly) with regexps.
If you will want to merge it, but are concerned about its maintenance, I can take a limited/junior maintainer role with a focus on JavaScript and docs.

tk0miya · 2018-06-21T16:12:55Z

I just remembered @timhoffm had sent such script to us at #4857. It might be a good workaround for this problem. Could you check this please?

@shimizukawa What do you think about the workaround?

tstibbs · 2018-06-22T07:07:44Z

Worth noting that both sphinx-doc.org and readthedocs.org seem to have fixed this problem already, so there are already solutions to this that are being used in anger. They're presumably happy with their solutions, so would understanding them help inform which route is the most sensible?

TimKam · 2018-06-22T07:26:51Z

@tstibbs afaik sphinx-doc.org is hosted by ReadTheDocs. And ReadTheDocs provides a custom search back end (using Haystack and Elasticsearch. But for the average self-hosted Sphinx project, a search back end is presumably too much work and fixing the front end-only search in one of the ways I described is necessary.

@tk0miya I will take a look at #4857 asap and compare it to the tests I wrote for my sphinx-pretty-searchresults extension.

pybride · 2018-06-22T07:32:11Z

As the original reporter of the issue, I would like that you don't forget that a search back end isn't always possible. In some cases, we need the documentation and the search to work offline, ie. the HTML directly opened in a browser on the same machine, without any web server (and also no Internet access).
Without this offline requirement, I would have since long switched to a better search engine.

TimKam · 2018-06-22T10:35:53Z

I took a look at #4857 (regexp-parsing) and compared it to #4022 (using HTML snippets).
That's how the comparison looks like:

Regexp:

HTML:

IMHO, the regexp approach requires quite some additional work. The only disadvantage of the HTML approach is that it loads significantly more data, but I personally still think it's feasible (probably better than implementing a reStructuredText parser in JavaScript).

We could make this configurable (opt-out).

Any other opinions on this?

timhoffm · 2018-06-22T11:44:04Z

Good to see that this topic gets attention.

I just did a minimum amount of work in #4857 (regexp-parsing) to get something readable.
The regexp approach gets you 80% of the way with no or little additional work on the minimal parser. A full reStructuredText parser wouldn't help that much more. Though there seem to be libraries for that, you'd still have to interpret the generated document tree.

The HTML search has clearly an advantage because it operates on the target document. Can you quantify how much "significantly more data" is?

Depending on how large the difference in data is and how important we consider it to be, making this configurable would be a good way. Regexp parsing is a drop-in improvement on the current plain rst search with no disadvantage. If you need even better results and are willing to take the data overhead, use HTML-search. Which one should be the default may depend on the number s of the data overhead.

TimKam · 2018-06-22T21:24:38Z

I compared the data load for a set of Sphinx documentation pages:

Page	Content length HTML (in bytes)	Content length regexp (in bytes)	Ratio (HTML/regexp)
builders	62208	16729	3.72
config	203764	85442	2.38
ext/extlinks	11756	2342	5.02
theming	33809	17427	1.94
quickstart	37713	9577	3.94
Sum	349250	131517	2.66

The differences in ratio can be explained by the following factors:

As the HTML pages contain fixed overhead (the general layout) for all pages, the ratio is worse for small pages.
Some pages contain directives that instruct Sphinx to include content from other sources (in particular docstrings, I suppose). This content is not included in the .rst files, which is a major flaw of the regexp approach. Note that I did not include pages that contain includedirectives, because this would have rendered the comparison useless.

Considering this information, I suggest we use the HTML approach without any configuration to keep things simple. It's just text content; IMHO there is no need to optimize for data load.

timhoffm · 2018-06-23T11:25:53Z

Thanks for digging into the numbers.

Considering all aspects, I'm fine with an HTML-only approach.

TimKam · 2018-06-23T21:24:45Z

@tk0miya Do you agree? Then we could move forward with my PR.

shimizukawa · 2018-07-16T05:26:40Z

+1 to the HTML-only approach. I think that the approach is sufficiently useful as workaround until Sphinx core has build output function for search display.

TimKam · 2018-07-16T19:41:27Z

I resolved the conflicts in the original PR (which is close to a year old). Could someone do the review?

tk0miya · 2018-07-17T14:43:56Z

+1 for HTML approach. It can support other source formats (markdown and others).

Setting `html_copy_source` no longer affects search results

acsr · 2018-07-19T07:38:05Z

+1 for the HTML approach since a source code based solution will be almost useless as we currently get the context of a search in the translation as "source in the canonical language" - This was very disappointing. The Patch available here https://github.com/sphinx-doc/sphinx/pull/4022/files was working for us very well:

…friendly #1618 make search results reader friendly

tk0miya · 2018-08-30T14:58:44Z

@TimKam Congrats! Thank you for your work!

ranjith19 · 2019-01-21T17:02:26Z

I am using version 1.8.3 and I am still facing this issue. I tried the code on the master branch and it is fixed. When is version 2 planned for public release?

TimKam · 2019-01-21T19:17:32Z

2.0 is planned for mid-end March (without a definite date), see: #5950. If you want the fix earlier, I recommend not to switch to master, but to instead adjust your template to include the updated version of searchtools.js, as changed here: https://github.com/sphinx-doc/sphinx/pull/4022/files#diff-71eb2d907f122b85744ef4c3390903cbR59

shimizukawa added type:enhancement enhance or introduce a new feature prio:low html search labels Jan 3, 2015

parsch mentioned this issue Nov 25, 2016

Search results show reST markup readthedocs/readthedocs.org#839

Closed

parsch referenced this issue Feb 16, 2017

Fix #3155: Fix JavaScript for html_sourcelink_suffix fails with IE …

1250738

…and Opera

pybride mentioned this issue May 3, 2017

html_sourcelink_suffix don't allow anymore to make search result user friendly #3696

Closed

TimKam mentioned this issue Aug 5, 2017

in i18n setups sphinxprettysearchresults returns only previews from canonical language #3862

Closed

TimKam added a commit to TimKam/sphinx that referenced this issue Aug 23, 2017

sphinx-doc#1618 make search results reader friendly

c25bf97

request results as HTML instead of source files retrieve preview snippet text from HTML

TimKam added a commit to TimKam/sphinx that referenced this issue Aug 23, 2017

sphinx-doc#1618 make search results reader friendly

14e39ad

request results as HTML instead of source files retrieve preview snippet text from HTML

TimKam mentioned this issue Aug 23, 2017

#1618 make search results reader friendly #4022

Merged

dbkinder mentioned this issue Apr 12, 2018

Have Sphinx search display txt not ReST as results zephyrproject-rtos/zephyr#7032

Closed

timhoffm mentioned this issue May 7, 2018

Improve search summary by filtering reStructuredText #4857

Closed

timhoffm mentioned this issue May 26, 2018

Ways to improve the documentation search results matplotlib/matplotlib#11315

Closed

tk0miya mentioned this issue Jul 17, 2018

Proposal: Invite TimKam as a new commiter #5190

Closed

TimKam added a commit to TimKam/sphinx that referenced this issue Jul 17, 2018

sphinx-doc#1618 remove obsolete warning from doc

6bbfc3b

Setting `html_copy_source` no longer affects search results

TimKam added a commit to TimKam/sphinx that referenced this issue Jul 17, 2018

sphinx-doc#1618 use file suffix config variable instead of '.html'

1c0cc20

TimKam added a commit to TimKam/sphinx that referenced this issue Aug 28, 2018

sphinx-doc#1618 document change

cbf2d8e

TimKam closed this as completed in #4022 Aug 29, 2018

TimKam added a commit that referenced this issue Aug 29, 2018

Merge pull request #4022 from TimKam/1618-make-search-results-reader-…

bc02abc

…friendly #1618 make search results reader friendly

TimKam mentioned this issue Dec 26, 2018

Broken search results for internal includes #2369

Closed

zeddee mentioned this issue Feb 1, 2019

Render markdown in search results mattermost/docs#625

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Search results aren't reader friendly #1618

HTML Search results aren't reader friendly #1618

shimizukawa commented Jan 3, 2015

shimizukawa commented Jan 3, 2015

KacerCZ commented May 7, 2015

TimKam commented May 8, 2016

lehmannro commented May 8, 2016

TimKam commented May 8, 2016

TimKam commented Jul 31, 2016

TimKam commented Aug 23, 2017

dbogdanov commented Sep 29, 2017

tstibbs commented Jun 8, 2018

TimKam commented Jun 8, 2018

tk0miya commented Jun 20, 2018

TimKam commented Jun 20, 2018

tk0miya commented Jun 21, 2018

tstibbs commented Jun 22, 2018

TimKam commented Jun 22, 2018

pybride commented Jun 22, 2018

TimKam commented Jun 22, 2018

timhoffm commented Jun 22, 2018 •

edited

Loading

TimKam commented Jun 22, 2018 •

edited

Loading

timhoffm commented Jun 23, 2018

TimKam commented Jun 23, 2018

shimizukawa commented Jul 16, 2018

TimKam commented Jul 16, 2018

tk0miya commented Jul 17, 2018

acsr commented Jul 19, 2018 •

edited

Loading

tk0miya commented Aug 30, 2018

ranjith19 commented Jan 21, 2019 •

edited

Loading

TimKam commented Jan 21, 2019

HTML Search results aren't reader friendly #1618

HTML Search results aren't reader friendly #1618

Comments

shimizukawa commented Jan 3, 2015

shimizukawa commented Jan 3, 2015

KacerCZ commented May 7, 2015

TimKam commented May 8, 2016

lehmannro commented May 8, 2016

TimKam commented May 8, 2016

TimKam commented Jul 31, 2016

TimKam commented Aug 23, 2017

dbogdanov commented Sep 29, 2017

tstibbs commented Jun 8, 2018

TimKam commented Jun 8, 2018

tk0miya commented Jun 20, 2018

TimKam commented Jun 20, 2018

tk0miya commented Jun 21, 2018

tstibbs commented Jun 22, 2018

TimKam commented Jun 22, 2018

pybride commented Jun 22, 2018

TimKam commented Jun 22, 2018

timhoffm commented Jun 22, 2018 • edited Loading

TimKam commented Jun 22, 2018 • edited Loading

timhoffm commented Jun 23, 2018

TimKam commented Jun 23, 2018

shimizukawa commented Jul 16, 2018

TimKam commented Jul 16, 2018

tk0miya commented Jul 17, 2018

acsr commented Jul 19, 2018 • edited Loading

tk0miya commented Aug 30, 2018

ranjith19 commented Jan 21, 2019 • edited Loading

TimKam commented Jan 21, 2019

timhoffm commented Jun 22, 2018 •

edited

Loading

TimKam commented Jun 22, 2018 •

edited

Loading

acsr commented Jul 19, 2018 •

edited

Loading

ranjith19 commented Jan 21, 2019 •

edited

Loading