Cannot index pages when using a custom port #461

ArthurFlag · 2019-05-28T14:26:33Z

Hi,

I've been using Docsearch without issue for weeks, but it suddenly seems that most of my content is not indexed.
I'm running a static website built using Sphinx and I host it on locally on localhost:8080.

I'm indexing it at the moment running a local install of docsearch (updated to the lastest master), and I'm using the following config:

{
  "index_name": "abc-index",
  "sitemap_urls": ["http://127.0.0.1:8080/sitemap.xml"],
  "start_urls": [
    {
      "url": "http://127.0.0.1:8080/docs/"
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/abc.html",
      "selector_key": "api-docs",
      "page_rank": 5
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/associate.html",
      "selector_key": "api-docs",
      "page_rank": 4
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/fulfillment.html",
      "selector_key": "api-docs",
      "page_rank": 0
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/consumer.html",
      "selector_key": "api-docs",
      "page_rank": -1
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/",
      "selector_key": "api-docs"
    }
  ],
  "stop_urls": [
    "http://127.0.0.1:8080/abc-cloud/../",
    "http://127.0.0.1:8080/abc-cloud/new.html",
    "http://127.0.0.1:8080/abc-cloud/hooks_new.html"
  ],
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, li"
    },
    "api-docs": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "pre code.json",
      "text": "p, li"
    }
  }
}

I get the following output from docsearch:

> DocSearch: http://127.0.0.1:8080/docs/ 60 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/ 1 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/associate.html 509 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/consumer.html 745 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/fulfillment.html 662 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/abc.html 2288 records)

What I notice is that the crawler does not seem to go into /docs/*, and /abc-cloud/*, only the pages that are starting urls are crawled.
I do I make the crawler recursive?

Thank you

The text was updated successfully, but these errors were encountered:

ArthurFlag · 2019-05-28T14:27:07Z

Got a reponse from @s-pace on the docsearch repo:

👋 @arthurflageul

This repo is only related to the front end part and the documentation of the product. Please move it here: https://github.com/algolia/docsearch-scraper>

Some quick lead that might you help to debug:

* Do not use URLs with a port as it might wrongly impact the crawl

* Are you sure that the sitemap is correctly parsed? Pages crawled from a sitemap are written in cyan blue

* Are you sure that the missing pages are linked from a crawled one thanks to a `<a/>` tag?

* The stop_url '"http://127.0.0.1:8080/abc-cloud/../"' is interpreted as a regex. Be careful with some side-effect

* Comment [these two lines](https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/index.py#L55-L56) and run it again to see the full logs of scrappy. You will have more details about the crawl

ArthurFlag · 2019-05-28T15:43:28Z

Ok so after double checking, it seems to be because i'm running on 8080.

For a bit of context, I'm indexing a private website. So I'm running it locally and I have a Jenkins job running the docsearch docker image on it.
Port 80 is typically reserved on most setups, so I have to use something else when I have Jenkins setup a local server for itself.

How much effort would it be to allow other ports?

s-pace · 2019-05-28T16:03:57Z

It just how we parse the URL, using a port will broke everything. I would recommend you to use a local DNS

ArthurFlag · 2019-05-29T08:58:20Z

Thanks for the answer.
This is a very unfortunate design decision and it is making things complicated for certain of your paying customers 😞
If there is a way to push a feature to your backlog, I would like to request the ability to crawl any url, regardless of the port.

Anyway, as a first improvement, I think this should be clearly documented.

s-pace · 2019-05-29T09:01:45Z

This is not a good practice to use a port when live, this is why we do not document it.

Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

🙏

CodeSandwich · 2020-05-31T07:57:37Z

It just hit me as well. I too think that it should be at least documented or generate a meaningful error. It's a regular procedure to develop and test a website on localhost port 8080, which makes this bug (?) a perfect beginners' trap.

252819 · 2020-05-31T20:42:56Z

Ok

huguesalary · 2020-11-13T21:13:41Z

Let me start by saying I really appreciate you guys making this tool open source, it's amazing.

Now, I just got hit by this as well, and it took me many hours before finding this thread.

This is not a good practice to use a port when live, this is why we do not document it.
Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

isn't a good justification. Tons of people run on non default ports. This should at a minimum be documented.

matkoch · 2022-06-26T06:13:44Z

Same here. Let me also say that I genuinely like Algolia, but this here is a bit ignorant

Port 80 is typically reserved on most setups

I would recommend you to use the default 80 port and avoid to precise it.

I also spend a good few hours on this and still have no workaround.

junandaip · 2022-09-19T10:19:17Z

Gonna up this one cause we need that port feature to test the scraper on a dev environment

ArthurFlag mentioned this issue May 28, 2019

Most pages are not index algolia/docsearch#717

Closed

s-pace closed this as completed Jun 9, 2019

vjpr mentioned this issue Jan 9, 2021

Allow custom ports for start_urls #539

Open

bidoubiwa mentioned this issue Mar 1, 2021

Docs-scraper not working when url has a port meilisearch/docs-scraper#103

Closed

ArthurFlag changed the title ~~Most pages are not index~~ Cannot index pages when using a custom port Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot index pages when using a custom port #461

Cannot index pages when using a custom port #461

ArthurFlag commented May 28, 2019

ArthurFlag commented May 28, 2019

ArthurFlag commented May 28, 2019

s-pace commented May 28, 2019

ArthurFlag commented May 29, 2019

s-pace commented May 29, 2019

CodeSandwich commented May 31, 2020

252819 commented May 31, 2020

huguesalary commented Nov 13, 2020

matkoch commented Jun 26, 2022 •

edited

Loading

junandaip commented Sep 19, 2022

Cannot index pages when using a custom port #461

Cannot index pages when using a custom port #461

Comments

ArthurFlag commented May 28, 2019

ArthurFlag commented May 28, 2019

ArthurFlag commented May 28, 2019

s-pace commented May 28, 2019

ArthurFlag commented May 29, 2019

s-pace commented May 29, 2019

CodeSandwich commented May 31, 2020

252819 commented May 31, 2020

huguesalary commented Nov 13, 2020

matkoch commented Jun 26, 2022 • edited Loading

junandaip commented Sep 19, 2022

matkoch commented Jun 26, 2022 •

edited

Loading