Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run #274

Open
PotentSoftware opened this issue Oct 3, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@PotentSoftware
Copy link

I am writing an elixir web scraper using the "Crawly" module. I have created 2 modules. The first module named "ManufacturersToScrape"
collects a list of urls that it passes to the second module named "ModelsToScrape". The source code for both modules is shown
below under headings corresponding to the module source files. I've done this to show that the module names are consistent with
the source file names. Both modules compile without errors, and I know that the "ManufacturersToScrape" module collects well
formed urls as expected. However, at runtime I see multiple pipeline errors similar to this:

"[error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run(%Crawly.Request{url: [url: "https://www.anchorvans.co.uk/specifications/vauxhall/"

The errors show that urls are being created correctly. The application appears to be failing when the variable "next_requests"
is passed to module "ModelsToScrape". This module doesn't do much. I just wanted to prove that the urls are being passed in correctly,
and clearly they are not.

I have provided an execution log below under heading "execution log".

Firstly, please can you confirm whether the call:

 "next_requests = Enum.map(manufacturer_urls, &Crawly.Request.new(url: &1 |> to_string(), spider: ModelsToScrape))"

is the correct approach. Using iex, and passing a list to the Enum above returns the following data structure which looks good to me:

next = Enum.map(urls, &Crawly.Request.new(url: &1 |> to_string(), spider: ModelsToScrape))
[
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/citroen/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/fiat/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/ford/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/isuzu/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/iveco/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/kia/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/landrover/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/ldv/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/mazda/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/mercedes/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/mitsubishi/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/nissan/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/peugeot/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/piaggio/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/renault/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/toyota/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/vauxhall/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
},
%Crawly.Request{
    url: [
    url: "https://www.anchorvans.co.uk/specifications/volkswagen/",
    spider: ModelsToScrape
    ],
    headers: [],
    prev_response: nil,
    options: [],
    middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}
    ],
    retries: 0
}
] 

It seems to me that there could be a problem with either httpoison, or its interface with Crawly. 

manufacturers_to_scrape.ex
--------------------------
defmodule ManufacturersToScrape do
use Crawly.Spider

@impl Crawly.Spider
def base_url(), do: "https://www.anchorvans.co.uk/"

@impl Crawly.Spider
def init() do
    [start_urls: ["https://www.anchorvans.co.uk/specifications/"]]
end

@impl Crawly.Spider
@doc """
    Extract items and requests to follow from the given response
"""
def parse_item(response) do
    # Extract item field from the response here. Usually it's done this way:
    # {:ok, document} = Floki.parse_document(response.body)
    # item = %{
    #   title: document |> Floki.find("title") |> Floki.text(),
    #   url: response.request_url
    # }
    {:ok, document} = Floki.parse_document(response.body)
    current_url = response.request_url
    IO.inspect(current_url, label: "Current URL")

    manufacturer_names = collect_manufacturer_names(document)

    IO.inspect(manufacturer_names, label: "Manufacturer Names")

    # Form urls based on manufacturer name so we can find the vehicle models
    manufacturer_urls = collect_manufacturer_urls(current_url, manufacturer_names)
    IO.inspect(manufacturer_urls, label: "Manufacturer URLS")

    # extracted_items = manufacturer_names
    extracted_items = []

    next_requests = Enum.map(manufacturer_urls, &Crawly.Request.new(url: &1 |> to_string(), spider: ModelsToScrape))
    IO.inspect(next_requests, label: "Next Requests")

    # next_requests = Enum.map(manufacturer_urls, &Crawly.Request.new(url: &1, spider: ModelsToScrape))
    %Crawly.ParsedItem{items: extracted_items, requests: next_requests}
end

defp collect_manufacturer_names(document) do
    manufacturer_names =
    document
    |> Floki.find("div[class='group manufacturers'] a[href] h3")
    |> Enum.map(fn {_first, _second, last} -> last end)
    |> Enum.flat_map(& &1)
    |> Enum.map(&String.downcase/1)
end

defp collect_manufacturer_urls(url, manufacturer_names) do
    manufacturer_urls =
    manufacturer_names
    |> Enum.map(fn str -> "#{url}#{str}/" end)
end

end

models_to_scrape.ex

defmodule ModelsToScrape do
use Crawly.Spider

@impl Crawly.Spider
def base_url(), do: "https://www.anchorvans.co.uk/"

@impl Crawly.Spider
def init() do
[start_urls: ["https://www.anchorvans.co.uk/"]]
end

@impl Crawly.Spider
@doc """
Extract items and requests to follow from the given response
"""
def parse_item(response) do
{:ok, document} = Floki.parse_document(response.body)

current_url = response.request_url
IO.inspect(current_url, label: "ModelsToScrape Current URL")

extracted_items = []

next_requests = []
%Crawly.ParsedItem{items: extracted_items, requests: next_requests}

end
end

execution log

.
.
.
09:55:57.027 [error] Pipeline crash by call:
Crawly.Middlewares.UniqueRequest.run(%Crawly.Request{url: [url: "https://www.anchorvans.co.uk/specifications/volkswagen/", spider: ModelsToScrape], headers: [], prev_response: %HTTPoison.Response{status_code: 200, body: "\n\n\n\n\n <html lang="en" prefix="fb: http://www.facebook.com/2008/fbml\" class="no-js"> \n <head prefix="og: http://ogp.me/ns# object: http://ogp.me/ns/object#\">\n <meta charset="utf-8">\n <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\n <title>Specifications - Anchor Vans</title>\n <meta property="og:locale" content="en_GB" />\n <meta property="og:type" content="article" />\n <meta property="og:description" content="Place holder for list of makes - template for this page shows the top level of van_make_model taxonomy" />\n <meta name="description" content="Place holder for list of makes - template for this page shows the top level of van_make_model taxonomy">\n <meta property="og:image" content="//www.anchorvans.co.uk/img/logo.png">\n <meta property="og:image:type" content="image/png">\n <meta property="og:image:width" content="358">\n <meta property="og:image:height" content="148">\n <meta property="og:url" content="https://www.anchorvans.co.uk/specifications/\" />\n <meta property="og:site_name" content="Anchor Vans" />\n <meta property="og:title" content="Specifications" />\n\n \n <link rel="stylesheet" href="/css/jquery-ui-1.10.2.custom.min.css">\n <link rel="stylesheet" media="screen" href="/css/anchorvans.css">\n \n <script src="/js/modernizr-2.6.2.min.js"></script>\n <script src="/js/prefixfree.min.js" type="text/javascript"></script>\n <script src="/js/jquery-1.9.1.min.js"></script>\n <script src="/js/jquery-ui-1.10.1.custom.min.js"></script>\n <script src="/js/jquery.tablesorter.min.js"></script>\n <script src="/js/jquery.leanModal.min.js"></script>\n <script src="/js/jquery.waitforimages.min.js"></script>\n <script src="/js/jquery.cookie.js"></script>\n <script src="/js/anchor.min.js"></script>\n <script src="/js/teammembers.js"></script>\n \t\t<script>\n \t\t (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\n \t\t (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1new Date();a=s.createElement(o),\n \t\t m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n \t\t })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');\n\n \t\t ga('create', 'UA-438032-1', 'auto');\n \t\t ga('send', 'pageview');\n\n \t\t</script>\n\t\t <link rel="alternate" type="application/rss+xml" title="Used Vans RSS feed" href="/vans/rss.xml" />\n <link href="/favicon.ico" rel="Shortcut Icon" type="image/vnd.microsoft.icon" />\n \n <body class="specifications" itemscope itemtype="http://data-vocabulary.org/Organization\">\n <div class="widget-ui">\n <div style="display: none" id="cookie-message" class="cookie-message ui-widget-header">\n <p title="This site uses cookies. By continuing to use the site you give consent for us to set cookies in your browser">\n This site uses cookies. By continuing to use the site you give consent for us to set cookies in your browser.\n <a class="cookie-message-close" href="#cookie-message">Hide\n

\n \n \n <nav id="social">\n \t<ul class="social boundary">\n <li class="facebook"><a href="http://www.facebook.com/anchorvans\">\n " <> ..., headers: [{"Date", "Tue, 03 Oct 2023 08:55:55 GMT"}, {"Server", "Apache/2.4.29 (Ubuntu)"}, {"Link", "https://www.anchorvans.co.uk/blog/wp-json/; rel="https://api.w.org/""}, {"Link", "https://www.anchorvans.co.uk/blog/?p=3109; rel=shortlink"}, {"Access-Control-Allow-Origin", ""}, {"Access-Control-Allow-Headers", "Content-Type"}, {"Access-Control-Allow-Methods", "GET, POST, DELETE, PUT, OPTIONS, HEAD"}, {"X-Powered-By", ""}, {"Set-Cookie", "PHPSESSID=1mo0elfplj5k69fussqabaktu2; path=/"}, {"Expires", "Thu, 19 Nov 1981 08:52:00 GMT"}, {"Cache-Control", "no-store, no-cache, must-revalidate"}, {"Pragma", "no-cache"}, {"Vary", "Accept-Encoding"}, {"Transfer-Encoding", "chunked"}, {"Content-Type", "text/html; charset=UTF-8"}], request_url: "https://www.anchorvans.co.uk/specifications/", request: %HTTPoison.Request{method: :get, url: "https://www.anchorvans.co.uk/specifications/", headers: [{"User-Agent", "Crawly Bot"}], body: "", params: %{}, options: []}}, options: [], middlewares: [Crawly.Middlewares.DomainFilter, Crawly.Middlewares.UniqueRequest, {Crawly.Middlewares.UserAgent, [user_agents: ["Crawly Bot", "Google"]]}], retries: 0}, %{count: 17, requests: [%Crawly.Request{url: [url: "https://www.anchorvans.co.uk/specifications/vauxhall/", spider: ModelsToScrape], headers: [{"User-Agent", "Crawly Bot"}], prev_response: %HTTPoison.Response{status_code: 200, body: "\n\n\n\n\n <html lang="en" prefix="fb: http://www.facebook.com/2008/fbml" class="no-js"> \n <head prefix="og: http://ogp.me/ns# object: http://ogp.me/ns/object#">\n <meta charset="utf-8">\n <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\n <title>Specifications - Anchor Vans</title>\n <meta property="og:locale" content="en_GB" />\n <meta property="og:type" content="article" />\n <meta property="og:description" content="Place holder for list of makes - template for this page shows the top level of van_make_model taxonomy" />\n <meta name="description" content="Place holder for list of makes - template for this page shows the top level of van_make_model taxonomy">\n <meta property="og:image" content="//www.anchorvans.co.uk/img/logo.png">\n <meta property="og:image:type" content="image/png">\n <meta property="og:image:width" content="358">\n <meta property="og:image:height" content="148">\n <meta property="og:url" content="https://www.anchorvans.co.uk/specifications/" />\n <meta property="og:site_name" content="Anchor Vans" />\n <meta property="og:title" content="Specifications" />\n\n \n <link rel="stylesheet" href="/css/jquery-ui-1.10.2.custom.min.css">\n <link rel="stylesheet" media="screen" href="/css/anchorvans.css">\n \n <script src="/js/modernizr-2.6.2.min.js"> (truncated)

09:56:07.027 [error] GenServer #PID<0.294.0> terminating
** (ArgumentError) cannot convert the given list to a string.

To be converted to a string, a list must either be empty or only
contain the following elements:

  • strings
  • integers representing Unicode code points
  • a list containing one of these three elements

Please check the given list or call inspect/1 to get the list representation, got:

[url: "https://www.anchorvans.co.uk/specifications/volkswagen/", spider: ModelsToScrape]

(elixir 1.15.5) lib/list.ex:1084: List.to_string/1
(httpoison 1.8.2) lib/httpoison.ex:258: HTTPoison.request/1
(crawly 0.16.0) lib/crawly/worker.ex:89: Crawly.Worker.get_response/1
(crawly 0.16.0) lib/crawly/worker.ex:48: Crawly.Worker.handle_info/2
(stdlib 5.0.2) gen_server.erl:1077: :gen_server.try_handle_info/3
(stdlib 5.0.2) gen_server.erl:1165: :gen_server.handle_msg/6
(stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3

Last message: :work
State: %Crawly.Worker{backoff: 10000, spider_name: ManufacturersToScrape, crawl_id: "a95dbf36-61ca-11ee-a44d-ac87a32a845f"}

@oltarasenko oltarasenko added the bug Something isn't working label Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants