How to send some value to web page and get the output saved or retrieved as dataframe #103

raulfernandezn · 2020-08-19T23:08:59Z

Hi all. I am using crrri package in order to obtain some data from web. This package works better than other web scraping packages. My question is if it is possible to send a value to some page in order to obtain some information. The web page is next:

https://orcid.org/

It looks like this:

In the superior right side it has an option to insert a code in order to obtain some information:

I would like to know if it is possible to insert a value in the search section, then execute the searching, obtain a result and save it into a dataframe to environment. For example using next code: 0000-0001-6300-9350 in the page it will be obtained next result:

I would like to know if using the crrri package is posible to obtain that result and saving it to a dataframe. I know you have to do the steps of inserting the value and then click on the search option but I am not able to replicate those steps with code.

This is why I have tried but any direction of functions to obtain that information would be fantastic.

library(crrri)
chrome <- Chrome$new(bin = find_chrome_binary())
client <- chrome$connect(callback = function(client) {
  client$inspect()
})
Page <- client$Page

Page$navigate(url = "https://orcid.org/", 
              callback = function(result) {
                cat("The R session has received this result from Chrome!\n")
                print(result)
              })

async_result <- Page$navigate(url = "https://orcid.org/")
async_result %...>% print()

The text was updated successfully, but these errors were encountered:

cderv · 2020-08-20T07:38:38Z

Thanks for using crrri ! I observed that when you type your ID in the search bar, it creates a new url of the form
https://orcid.org/orcid-search/search?searchQuery=0000-0001-6300-9350

That means you can replicate the search by creating this type of url with filling the ID in the query. (using paste() function, glue 📦 or urltools 📦 will help to do that)

Then you'll get to the desired page where you can retreive the table I guess.

Dos it make sense ? Did you already tried that ?

raulfernandezn · 2020-08-20T10:29:14Z

Dear Dr. Christophe Dervieux thanks for your message. What you mention is correct. The easy way is to take that string and use a package to extract. But I am working with different web pages and most of them don't have that option.

That is why I would like to do the task with crrri at least until the point where I have to send the string to the page and enable the search.

Many thanks for your help, I have been trying to do that extraction with your package but my knowledge is limited.

Kind regards!

cderv · 2020-08-20T11:10:42Z

I don't know how to do that. It requires some searching. It seems you can to that with puppeteer but I don't know how they do that.

@RLesur is there a JS way to do that ?

RLesur · 2020-08-21T12:56:42Z

I would do that as follows:

set the focus on the input element. We need to run Javascript in the browser to do that using the Runtime.evaluate method
insert text in the input field with the Input.insertText method
"press" the Enter key. We need to simulate both the keydown and the keyup event using the Input.dispatchKeyEvent method

For instance,

# Helpers -----------------------------------------------------------------
# Set the focus on an element
set_focus <- function(client, selector) {
  client$Runtime$evaluate(
    sprintf("document.querySelector('%s').focus()", selector)
  )
}

# Press the 'Enter' key
press_enter_key <- function(client) {
  dispatch_enter_key_event <- function(client, type) {
    client$Input$dispatchKeyEvent(
      type = type, 
      windowsVirtualKeyCode = 13, 
      code = "Enter", 
      key = "Enter", 
      text = "\r", 
      unmodifiedText = "\r"
    )
  }
  
  promises::then(
    dispatch_enter_key_event(client, "keyDown"),
    ~ dispatch_enter_key_event(client, "keyUp")
  )
}

# Main script -------------------------------------------------------------
library(crrri)

chrome <- Chrome$new(bin = find_chrome_binary())
client <- chrome$connect(callback = function(client) {
  client$inspect()
})

Page <- client$Page
Input <- client$Input

Page$enable() %...>% {
  Page$navigate(url = "https://orcid.org/")
  Page$loadEventFired() # await the load event
} %...>% {
  # the selector may be different for another website
  # you may need to modify it
  set_focus(client, selector = "input")
} %...>% {
  Input$insertText("0000-0002-0721-5595") # type an ORCID
} %...>% {
  press_enter_key(client)
}

cderv · 2020-08-21T13:17:26Z

So it seems you can do that in JS.
With puppeteer, it seems very easy !

page.type => https://devdocs.io/puppeteer/index#pagetypeselector-text-options
page.submit => https://devdocs.io/puppeteer/index#pageclickselector-options
snippet: https://www.codota.com/code/javascript/functions/puppeteer/Page/type

Not easy to reproduce here I think :(

RLesur · 2020-08-21T14:19:31Z

@cderv puppeteer has dozens of great high level functions!

page.type, page.press,... are defined here https://github.com/puppeteer/puppeteer/blob/main/src/common/Input.ts.
For the press_enter_key() function, I've stolen the parameters here.

The page.click method is definitely great and wouldn't be so easy to reproduce (but this is feasible).

raulfernandezn · 2020-08-22T23:48:22Z

@RLesur Dear Dr. Lesur. Many thanks for your time with this issue. It is amazing how to reach to the final output. As I mentioned, I am trying different pages. When I apply over the next page: https://dependenciasectorpublico.trabajo.gob.ec/DependenciaLaboralSectorPublico/

I am not able to get any results. Maybe, due to my lack of web page knowledge I am not using or defining proper features. I use next code. It is the same as yours:

# Helpers -----------------------------------------------------------------
# Set the focus on an element
set_focus <- function(client, selector) {
  client$Runtime$evaluate(
    sprintf("document.querySelector('%s').focus()", selector)
  )
}

# Press the 'Enter' key
press_enter_key <- function(client) {
  dispatch_enter_key_event <- function(client, type) {
    client$Input$dispatchKeyEvent(
      type = type, 
      windowsVirtualKeyCode = 13, 
      code = "Enter", 
      key = "Enter", 
      text = "\r", 
      unmodifiedText = "\r"
    )
  }
  
  promises::then(
    dispatch_enter_key_event(client, "keyDown"),
    ~ dispatch_enter_key_event(client, "keyUp")
  )
}

# Main script -------------------------------------------------------------
library(crrri)

chrome <- Chrome$new(bin = find_chrome_binary())
client <- chrome$connect(callback = function(client) {
  client$inspect()
})

Page <- client$Page
Input <- client$Input

Page$enable() %...>% {
  Page$navigate(url = "https://dependenciasectorpublico.trabajo.gob.ec/DependenciaLaboralSectorPublico/")
  Page$loadEventFired() # await the load event
} %...>% {
  # the selector may be different for another website
  # you may need to modify it
  set_focus(client, selector = "input")
} %...>% {
  Input$insertText("0952330223") # type the chain
} %...>% {
  press_enter_key(client)
}

I have inspected the the web page and the selector is an input as seen next:

So set_focus() should work but it doesn't.

The output I get is next:

Which is empty. I have tried changing the selector but it doesn't work.

I would thank for any kind of help in order to solve this issue. The code works perfectly for the first page but for this second page is not working.

RLesur · 2020-08-23T07:58:15Z

You get this issue because this page has several input elements. The JavaScript method document.querySelector() returns the first matched element. In this case, the first input element is not the input field that you want.

As you can see, the NÚMERO DE DOCUMENTO input element has an id. You can pass this id to the document.getElementById() JavaScript method.

# Helpers -----------------------------------------------------------------
# Set the focus on an element
set_focus_on_id <- function(client, id) {
  client$Runtime$evaluate(
    sprintf("document.getElementById('%s').focus()", id)
  )
}

# Press the 'Enter' key
press_enter_key <- function(client) {
  dispatch_enter_key_event <- function(client, type) {
    client$Input$dispatchKeyEvent(
      type = type, 
      windowsVirtualKeyCode = 13, 
      code = "Enter", 
      key = "Enter", 
      text = "\r", 
      unmodifiedText = "\r"
    )
  }
  
  promises::then(
    dispatch_enter_key_event(client, "keyDown"),
    ~ dispatch_enter_key_event(client, "keyUp")
  )
}

# Main script -------------------------------------------------------------
library(crrri)

chrome <- Chrome$new(bin = find_chrome_binary())
client <- chrome$connect(callback = function(client) {
  client$inspect()
})

Page <- client$Page
Input <- client$Input

Page$enable() %...>% {
  Page$navigate(url = "https://dependenciasectorpublico.trabajo.gob.ec/DependenciaLaboralSectorPublico/")
  Page$loadEventFired() # await the load event
} %...>% {
  # the id may be different for another website
  # you may need to modify it
  set_focus_on_id(client, id = "frmTodo:txtCedula")
} %...>% {
  Input$insertText("0952330223") # type the chain
} %...>% {
  press_enter_key(client)
}

raulfernandezn · 2020-08-24T17:24:56Z

Dear Dr. Lesur @RLesur infinite thanks for your help. Now it is easier for me managing the page structure to obtain the result. I am curious if it is possible to apply a similar method to pages with captcha. I have some of them in the bunch of pages I have to work. But if it is not possible I will understand and I would have to avoid those pages. I have next page where I used the code you kindly helped me. The page is this:

https://www.senescyt.gob.ec/web/guest/consultas

It has a field that can be completed with the same structure you shared but it includes a captcha option like this:

Here the code I used:

# Helpers -----------------------------------------------------------------
# Set the focus on an element
set_focus_on_id <- function(client, id) {
  client$Runtime$evaluate(
    sprintf("document.getElementById('%s').focus()", id)
  )
}

# Press the 'Enter' key
press_enter_key <- function(client) {
  dispatch_enter_key_event <- function(client, type) {
    client$Input$dispatchKeyEvent(
      type = type, 
      windowsVirtualKeyCode = 13, 
      code = "Enter", 
      key = "Enter", 
      text = "\r", 
      unmodifiedText = "\r"
    )
  }
  
  promises::then(
    dispatch_enter_key_event(client, "keyDown"),
    ~ dispatch_enter_key_event(client, "keyUp")
  )
}

# Main script -------------------------------------------------------------
library(crrri)

chrome <- Chrome$new(bin = find_chrome_binary())
client <- chrome$connect(callback = function(client) {
  client$inspect()
})

Page <- client$Page
Input <- client$Input

Page$enable() %...>% {
  Page$navigate(url = "https://www.senescyt.gob.ec/web/guest/consultas")
  Page$loadEventFired() # await the load event
} %...>% {
  # the id may be different for another website
  # you may need to modify it
  set_focus_on_id(client, id = "formPrincipal:identificacion")
} %...>% {
  #1720245768
  Input$insertText("0952330223") # type the chain
} %...>% {
  press_enter_key(client)
}

The same scheme works as the value for first field can be inserted using the previous functions. With that code I can reach until this stage:

After exploring the structure of this page, I noticed this:

The captcha is produced and saved into a .jpg file. whose direction appears in the src argument in the id named formPrincipal:capimg. I was thinking if there could be a way to obtain that image in the src argument and then extract the text.

The text could be extracted from the image with functions from tesseract package like this: eng <- tesseract("eng") and text <- tesseract::ocr("thepathfromsrc", engine = eng). This could be a new string and then I could pass with Input$insertText to another id.

What would be an approach to obtain the captcha image and then feed the extracted text to the page? Many thanks for your help.

RLesur · 2020-08-25T15:19:12Z

Sorry, I haven't sufficient time for diving more in the captcha case.
In order to do that, I'd try to use the Network domain to intercept the images. You can find an example with the chrome-remote-interface Node module here: https://stackoverflow.com/a/46079635.

raulfernandezn · 2020-08-26T00:14:49Z

Dear Dr. Lesur @RLesur many thanks with that info I will reseach about the topic. I have about if it is possible to modify press_enter_key function into a function that makes click on a search button in a web page. Could you please recommend me what crrri functions I can use to build a function of this style?

RLesur · 2020-08-26T22:06:03Z

I've never tried to click on an element but I would adopt the following strategy:

DOM.getDocument to obtain the root nodeId
DOM.querySelector to obtain the nodeId of the selected element (the root nodeId is required here)
DOM.getBoxModel to obtain the coordinates of the element bounding box. Then, compute the element centroid.
Input.dispatchMouseEvent twice (the first one with the mousePressed event, the second one with the mouseReleased event). The element centroid is used here.

I hope this will help you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to send some value to web page and get the output saved or retrieved as dataframe #103

How to send some value to web page and get the output saved or retrieved as dataframe #103

raulfernandezn commented Aug 19, 2020

cderv commented Aug 20, 2020 •

edited

Loading

raulfernandezn commented Aug 20, 2020

cderv commented Aug 20, 2020

RLesur commented Aug 21, 2020

cderv commented Aug 21, 2020

RLesur commented Aug 21, 2020

raulfernandezn commented Aug 22, 2020

RLesur commented Aug 23, 2020 •

edited

Loading

raulfernandezn commented Aug 24, 2020

RLesur commented Aug 25, 2020

raulfernandezn commented Aug 26, 2020

RLesur commented Aug 26, 2020

How to send some value to web page and get the output saved or retrieved as dataframe #103

How to send some value to web page and get the output saved or retrieved as dataframe #103

Comments

raulfernandezn commented Aug 19, 2020

cderv commented Aug 20, 2020 • edited Loading

raulfernandezn commented Aug 20, 2020

cderv commented Aug 20, 2020

RLesur commented Aug 21, 2020

cderv commented Aug 21, 2020

RLesur commented Aug 21, 2020

raulfernandezn commented Aug 22, 2020

RLesur commented Aug 23, 2020 • edited Loading

raulfernandezn commented Aug 24, 2020

RLesur commented Aug 25, 2020

raulfernandezn commented Aug 26, 2020

RLesur commented Aug 26, 2020

cderv commented Aug 20, 2020 •

edited

Loading

RLesur commented Aug 23, 2020 •

edited

Loading