Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor HTMLWriter search index construction #966

Merged
merged 6 commits into from
Mar 3, 2019

Conversation

visr
Copy link
Contributor

@visr visr commented Feb 18, 2019

The mutable SearchIndexBuffer that held an IOBuffer that was filled
with JSON text is replaced by the immutable SearchRecord.

The HTMLContext now instead holds a Vector{SearchRecord}, which is
serialized to JSON at the end, done so with the help of
JSON.lower(::SearchRecord).

This should make it easier to possibly upload the search index to a
search server in the future. (Working on it, thought to try to get this in first)

Note that this adds JSON as a dependency. But this also allows code deletion in this package. It REQUIREs JSON 0.19, does the UUID in the TOML need to match this version? Now I believe it matches 0.20.

The aim was for this to be a nonfunctional change, but I see the generated search index has almost twice as many entries here. I'm not sure if that is because of inadvertent bugfixes or that I made a mistake, would be great to have some input there. Here are two files two compare:
https://gist.github.com/visr/fffcfb3d62c9a5d99f86104d3ac62ac5. This index is quite large and hard to compare, so here is a simpler one: https://gist.github.com/visr/8247c5c44344e995b38c4ee6d934bfbb/revisions?diff=split#diff-964a59d491880661889b0e908c47a1e0. One big difference is that now each paragraph gets it's own entry. Maybe it was always intended like this? I didn't change it on purpose. Actually I prefer one record per paragraph, since it can make search results better.

@fredrikekre
Copy link
Member

does the UUID in the TOML need to match this version

UUID is an identifier of a package, not a version.

@mortenpi
Copy link
Member

Apologies for not reviewing this earlier.

First, the refactoring of the search generation looks great. How do you intend to use this when uploading an index? Do you need a separate serializer for the SearchRecords, hence the need for a better intermediate representation of the index?

I am a bit sceptical about the JSON.jl dependency. In general, we have been very conservative with dependencies, as this this effectively adds JSON as a dependency to Julia. A JSON serializer for this limited case should be simple enough that we could maintain it here. On the other hand, in the 1.x era, a pinned JSON.jl should never really break and the package is pretty lightweight. So I need to think about this a bit more, but would love to hear opinions.

On a separate note, we're actually not using JSON, but JavaScript in the index file. Do you know if JSON.jl escapes the problematic Unicode characters in strings (this should be the only thing where JSON is not a subset of JS)?

.travis.yml Show resolved Hide resolved
@KristofferC
Copy link
Member

this effectively adds JSON as a dependency to Julia

FWIW, the benchmarking framework that runs daily on Julia also requires JSON.

visr added 3 commits February 25, 2019 12:00
…Buffer` that held an `IOBuffer` that was filledwith JSON text is replaced by the immutable `SearchRecord`.The `HTMLContext` now instead holds a `Vector{SearchRecord}`, which isserialized to JSON at the end, done so with the help of`JSON.lower(::SearchRecord)`.This should make it easier to possibly upload the search index to asearch server in the future.
To try to fix error about missing JSON
And git ignore them all. Also adds Documenter to the Project.toml in docs.
@visr
Copy link
Contributor Author

visr commented Feb 25, 2019

How do you intend to use this when uploading an index? Do you need a separate serializer for the SearchRecords, hence the need for a better intermediate representation of the index?

So far I believe I can use the same index for both uses. The easiest way would be to add a keyword argument search to HTML where you can set "lunr" or "algolia". For "lunr" we keep the current general behavior. For "algolia" we use HTTP.jl to upload the JSON index to Algolia servers, and write a different search.js in the build, which uses Algolia search.

I am a bit sceptical about the JSON.jl dependency.

One of the reasons I grabbed for JSON.jl was because the JSON currently generated was not valid, despite attempts at this with jsescape. If for instance from here you take the list (value of "docs"), and copy it in jsonlint.com, you can see there are still parser errors. So I don't think it's that simple to do it right without JSON as a dependency. Though ideas welcome, I understand you want to keep the dependencies as light as possible.

Though, seeing my comment about using HTTP.jl to upload the index, would it be a problem to depend on that as well? Do you see any alternative? Taking into account that hopefully in the future Base search will use the "algolia" option.

We're actually not using JSON, but JavaScript.

Good point, I just added a commit to just use JSON: aceb999

EDIT: I verified that search still works with a simple fork of Example.jl that uses this branch: https://visr.github.io/Example.jl/dev/.

@mortenpi
Copy link
Member

Does it work locally with file:// URLs? I am worried about same-origin policy.. but it might not actually be an issue here.

@visr
Copy link
Contributor Author

visr commented Feb 25, 2019

Locally I need to run a http-server for search to work, just like before. And on GitHub pages it seems to work fine: https://visr.github.io/Example.jl/dev/. Seems to me this would be same-origin, unless I misunderstand.

@mortenpi
Copy link
Member

One of the reasons I grabbed for JSON.jl was because the JSON currently generated was not valid, despite attempts at this with jsescape. If for instance from here you take the list (value of "docs"), and copy it in jsonlint.com, you can see there are still parser errors.

FWIW, it seems to be valid JS, which is what it was supposed to be. It looks like you're not allowed to escape ' characters in JSON. An MWE would be {"docs": "\'"}.

Locally I need to run a http-server for search to work, just like before. And on GitHub pages it seems to work fine: https://visr.github.io/Example.jl/dev/. Seems to me this would be same-origin, unless I misunderstand.

No, search should work without a local server too. And it's when you open with file:// URLs that you can run into problems with same-origin policy (if I remember correctly, when you try to access files from parent directories, but I am not sure).

@mortenpi
Copy link
Member

Yeah, unfortunately we need to stick to JS files unfortunately. With pretty URLs off:

jquery.min.js:4 Access to XMLHttpRequest at 'file:///home/mortenpi/Julia/JuliaDocs/Documenter/docs/build/search_index.json' from origin 'null' has been blocked by CORS policy: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.

However, that does not mean we have to give up on JSON.jl necessarily, as long as we can add in the additional Unicode escapes. And then for Algolia one can, without much extra code, output a proper JSON file.

Speaking of Algolia and HTTP.jl, one possible alternative would be to just use curl for uploading perhaps?

@visr
Copy link
Contributor Author

visr commented Feb 25, 2019

Hmm interesting, how did you check this? If I just open an old build and try to search anything, I just get this:
image

Good to know it's valid JS. If I want to use the same for Algolia it indeed needs to be JSON. It seems possible to keep most of this PR intact and drop the JSON dependency if you prefer. I might copy some serialisation code from JSON.jl.

I guess curl may work instead, not sure about Windows. Though I mostly expect it to be built from Linux CI, where you then set secure environment variables with the API keys.

@mortenpi
Copy link
Member

Hmm interesting, how did you check this?

You need to build with HTML(prettyurls=false) to get a "local" build (it doesn't create all the directories then, but .html files instead). With Documenter's docs you can just run julia docs/make.jl local.

It seems possible to keep most of this PR intact and drop the JSON dependency if you prefer. I might copy some serialisation code from JSON.jl.

My current feeling is that JSON.jl would actually be fine. Let's see if we get any strong objections, and if not, we can keep JSON.

HTTP.jl might be a bit too large and has too many deps of its own I feel. But that is not directly relevant for this PR.

By escaping two Unicode characters
@visr
Copy link
Contributor Author

visr commented Feb 26, 2019

Since for now there are no strong objections yet to JSON.jl, I went ahead a bit. I removed the commit that created a JSON file in the index, and instead added a commit that makes the extra escapes for JS that you mentioned.

As far as I'm concerned, this is now good to go. Though have you looked at the index differences mentioned in my first post?

@mortenpi
Copy link
Member

mortenpi commented Mar 3, 2019

Apologies for the delay, but LGTM! And yeah, let's keep JSON.jl.

The index differences seem to be due to the fact that previously we created a record for each header, with everything up to the next header appended to it, whereas with this we have one record for each top-level MD block. No objections to that from me. The search results are a bit different, but not obviously better or worse as far as I could tell. It seems to lower the priority of the "section" results slightly, which is probably a good thing.

@mortenpi mortenpi merged commit 28de68c into JuliaDocs:master Mar 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants