Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most pages are not index #717

Closed
ArthurFlag opened this issue May 28, 2019 · 2 comments
Closed

Most pages are not index #717

ArthurFlag opened this issue May 28, 2019 · 2 comments

Comments

@ArthurFlag
Copy link

Hi,

I've been using Docsearch without issue for weeks, but it suddenly seems that most of my content is not indexed.
I'm running a static website built using Sphinx and I host it on locally on localhost:8080.

I'm indexing it at the moment running a local install of docsearch (updated to the lastest master), and I'm using the following config:

{
  "index_name": "abc-index",
  "sitemap_urls": ["http://127.0.0.1:8080/sitemap.xml"],
  "start_urls": [
    {
      "url": "http://127.0.0.1:8080/docs/"
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/abc.html",
      "selector_key": "api-docs",
      "page_rank": 5
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/associate.html",
      "selector_key": "api-docs",
      "page_rank": 4
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/fulfillment.html",
      "selector_key": "api-docs",
      "page_rank": 0
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/consumer.html",
      "selector_key": "api-docs",
      "page_rank": -1
    },
    {
      "url": "http://127.0.0.1:8080/abc-cloud/",
      "selector_key": "api-docs"
    }
  ],
  "stop_urls": [
    "http://127.0.0.1:8080/abc-cloud/../",
    "http://127.0.0.1:8080/abc-cloud/new.html",
    "http://127.0.0.1:8080/abc-cloud/hooks_new.html"
  ],
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, li"
    },
    "api-docs": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "pre code.json",
      "text": "p, li"
    }
  }
}

I get the following output from docsearch:

> DocSearch: http://127.0.0.1:8080/docs/ 60 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/ 1 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/associate.html 509 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/consumer.html 745 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/fulfillment.html 662 records)
> DocSearch: http://127.0.0.1:8080/abc-cloud/abc.html 2288 records)

What I notice is that the crawler does not seem to go into /docs/*, and /abc-cloud/*, only the pages that are starting urls are crawled.
I do I make the crawler recursive?

Thank you

@s-pace
Copy link

s-pace commented May 28, 2019

👋 @arthurflageul

This repo is only related to the front end part and the documentation of the product. Please move it here: https://github.com/algolia/docsearch-scraper>

Some quick lead that might you help to debug:

  • Do not use URLs with a port as it might wrongly impact the crawl
  • Are you sure that the sitemap is correctly parsed? Pages crawled from a sitemap are written in cyan blue
  • Are you sure that the missing pages are linked from a crawled one thanks to a <a/> tag?
  • The stop_url '"http://127.0.0.1:8080/abc-cloud/../"' is interpreted as a regex. Be careful with some side-effect
  • Comment these two lines and run it again to see the full logs of scrappy. You will have more details about the crawl

@ArthurFlag
Copy link
Author

Sorry for the posting in the wrong location, closing this and moving there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants