Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot index pages when using a custom port #461

Closed
ArthurFlag opened this issue May 28, 2019 · 10 comments
Closed

Cannot index pages when using a custom port #461

ArthurFlag opened this issue May 28, 2019 · 10 comments

Comments

@ArthurFlag
Copy link

Hi,

I've been using Docsearch without issue for weeks, but it suddenly seems that most of my content is not indexed.
I'm running a static website built using Sphinx and I host it on locally on localhost:8080.

I'm indexing it at the moment running a local install of docsearch (updated to the lastest master), and I'm using the following config:

{
  "index_name": "abc-index",
  "sitemap_urls": ["http://127.0.0.1:8080/sitemap.xml"],
  "start_urls": [
    {
      "url": "http://127.0.0.1:8080/docs/"
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/abc.html",
      "selector_key": "api-docs",
      "page_rank": 5
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/associate.html",
      "selector_key": "api-docs",
      "page_rank": 4
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/fulfillment.html",
      "selector_key": "api-docs",
      "page_rank": 0
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/consumer.html",
      "selector_key": "api-docs",
      "page_rank": -1
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/",
      "selector_key": "api-docs"
    }
  ],
  "stop_urls": [
    "http://127.0.0.1:8080/abc-cloud/../",
    "http://127.0.0.1:8080/abc-cloud/new.html",
    "http://127.0.0.1:8080/abc-cloud/hooks_new.html"
  ],
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, li"
    },
    "api-docs": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "pre code.json",
      "text": "p, li"
    }
  }
}

I get the following output from docsearch:

> DocSearch: http://127.0.0.1:8080/docs/ 60 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/ 1 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/associate.html 509 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/consumer.html 745 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/fulfillment.html 662 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/abc.html 2288 records)

What I notice is that the crawler does not seem to go into /docs/*, and /abc-cloud/*, only the pages that are starting urls are crawled.
I do I make the crawler recursive?

Thank you

@ArthurFlag
Copy link
Author

Got a reponse from @s-pace on the docsearch repo:

👋 @arthurflageul

This repo is only related to the front end part and the documentation of the product. Please move it here: https://github.com/algolia/docsearch-scraper>

Some quick lead that might you help to debug:

* Do not use URLs with a port as it might wrongly impact the crawl

* Are you sure that the sitemap is correctly parsed? Pages crawled from a sitemap are written in cyan blue

* Are you sure that the missing pages are linked from a crawled one thanks to a `<a/>` tag?

* The stop_url '"http://127.0.0.1:8080/abc-cloud/../"' is interpreted as a regex. Be careful with some side-effect

* Comment [these two lines](https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/index.py#L55-L56) and run it again to see the full logs of scrappy. You will have more details about the crawl

@ArthurFlag
Copy link
Author

Ok so after double checking, it seems to be because i'm running on 8080.

For a bit of context, I'm indexing a private website. So I'm running it locally and I have a Jenkins job running the docsearch docker image on it.
Port 80 is typically reserved on most setups, so I have to use something else when I have Jenkins setup a local server for itself.

How much effort would it be to allow other ports?

@s-pace
Copy link
Contributor

s-pace commented May 28, 2019

It just how we parse the URL, using a port will broke everything. I would recommend you to use a local DNS

@ArthurFlag
Copy link
Author

Thanks for the answer.
This is a very unfortunate design decision and it is making things complicated for certain of your paying customers 😞
If there is a way to push a feature to your backlog, I would like to request the ability to crawl any url, regardless of the port.

Anyway, as a first improvement, I think this should be clearly documented.

@s-pace
Copy link
Contributor

s-pace commented May 29, 2019

This is not a good practice to use a port when live, this is why we do not document it.

Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

🙏

@s-pace s-pace closed this as completed Jun 9, 2019
@CodeSandwich
Copy link

It just hit me as well. I too think that it should be at least documented or generate a meaningful error. It's a regular procedure to develop and test a website on localhost port 8080, which makes this bug (?) a perfect beginners' trap.

@252819
Copy link

252819 commented May 31, 2020

Ok

@huguesalary
Copy link

Let me start by saying I really appreciate you guys making this tool open source, it's amazing.

Now, I just got hit by this as well, and it took me many hours before finding this thread.

This is not a good practice to use a port when live, this is why we do not document it.
Thanks for sharing, I would recommend you to use the default 80 port and avoid to precise it.

isn't a good justification. Tons of people run on non default ports. This should at a minimum be documented.

@matkoch
Copy link

matkoch commented Jun 26, 2022

Same here. Let me also say that I genuinely like Algolia, but this here is a bit ignorant

Port 80 is typically reserved on most setups

I would recommend you to use the default 80 port and avoid to precise it.

I also spend a good few hours on this and still have no workaround.

@junandaip
Copy link

Gonna up this one cause we need that port feature to test the scraper on a dev environment

@ArthurFlag ArthurFlag changed the title Most pages are not index Cannot index pages when using a custom port Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants