Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Port in start_urls #42

Open
JasonWhall opened this issue May 26, 2023 · 2 comments
Open

Use Port in start_urls #42

JasonWhall opened this issue May 26, 2023 · 2 comments

Comments

@JasonWhall
Copy link

Description

We currently have a site that we set up in the scraper config that is hosted on a non-standard HTTP/HTTPS port (3000). When setting the start_urls to a hostname with a port e.g. http://my-host:3000/ , the scraper fails with an error message suggesting it does not accept domains with ports. It looks like the old algolia scraper configs used to support ports so I assume this is related to an update to the scrapy package used in this forked solution.

Steps to reproduce

  • Build and run a docusaurus site locally, serving on http://localhost:3000
  • Update the Docsearch config to set the start_urls "start_urls":["http://localhost:3000/"]
  • run the docsearch scraper

Expected Behavior

  • Site is scraped and uploaded to Typesense server

Actual Behavior

Error returned from scraper:

PortWarning: allowed_domains accepts only domains without ports. Ignoring entry localhost:3000 in allowed_domains.
  warnings.warn(message, PortWarning)

Metadata

Typesense Version:

Docker images:

  • typesense/typesense:0.24.1
  • typesense/docsearch-scraper:0.6.0

OS: Linux

@jasonbosco
Copy link
Member

typesense-docsearch-scraper has all the commits from algolia-docsearch-scraper up to Dec 22, 2020. I don't see any updates in the algolia scraper since then where this port limitation was addressed...

Also I still see that error message about ports not allowed in allowed_domains in the master branch of scrapy here. So this limitation still exists as of today.

So I'm surprised to see a config in the docsearch scraper configs repo with a port number!

@noghartt
Copy link

Any update on that? I'm facing the same issue, but not understand if I'm able to test Typesense locally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants