Purr #8

Monduiz · 2017-11-23T19:05:24Z

Is there a way to use map() in a pipe with lexrank? Lets say I want to extract a summary sentence from documents collected in a data frame, one article per row.

I guess you would have to unnest_sentences for each row, then create a new table to store the top ranking sentences?

AdamSpannbauer · 2017-11-24T18:44:27Z

Restating the problem to check my understanding:
You want to input a data.frame of documents and get the top lexRanked sentence per document.

Assuming you want to stick to the tidyverse here is a possible solution using a custom function & purrr::map(). Below is taken from a gist created to address this issue; the gist includes additional code used to create the example input data.frame (my_df) & show the output.

#tidyverse solution to get top lexrank sentence per doc
library(dplyr)
library(purrr)
#convet to tibble
my_tbl = as_data_frame(my_df)

#function to get top lexranked sentence in a df
get_top_sentences = function(df_in, text_col = "text", n=1) {
  #perform piped lexrank process and extract top ranked sentence
  lex_df = lexRankr::unnest_sentences_(df_in, "sentences", text_col) %>% #parse sentences
    lexRankr::bind_lexrank(sentences, sent_id, level = "sentences") %>% #perform lexrank
    arrange(desc(lexrank)) %>% #get top ranked sentence(s)
    slice(1:n)
  return(lex_df)
}

#get top sentence(s) per document
#split into a list with document dfs as elements
top_sent_df = split(my_tbl, my_tbl$doc_id) %>% 
  #apply lexrank function to extract top n ranked sentences
  map(get_top_sentences, n=1) %>% 
  #recombine into single df
  bind_rows()

Monduiz · 2017-11-25T16:17:37Z

Thank you, this is showing me a way I think works like yours. Here is how I have adapted it. I just want to make sure I am getting the results I should be getting and the ranking is what it should be. This is how I see a tidy verse workflow guided by your example, hopefully, the results are valid:

library(rvest)
library(tidyverse)
library(stringr)
library(purrr)
library(lexRankr)

gm_headlines <- read_html("https://beta.theglobeandmail.com/politics/")

gm_links <- gm_headlines %>%
  html_nodes(".o-card__link") %>%
  html_attr("href") %>% 
  xml2::url_absolute("https://beta.theglobeandmail.com")

pages <- gm_links %>% map(read_html)

gm_articles <- pages %>% 
  map(. %>% 
        html_nodes(".c-article-body__text") %>% 
        html_text()
  )

gm_titles <- gm_headlines %>%
  html_nodes('.o-card__content-text') %>%
  html_text

gm <- data_frame(gm_titles, gm_links, gm_articles)

# Remove duplicates and video links
gm <- gm %>% 
  distinct(gm_titles, .keep_all = TRUE) %>% 
  filter(!str_detect(gm_links, 'video')) %>%
  mutate(doc_id = 1:length(gm_articles))


### summarization
gm_unnest <- gm %>% 
  select(doc_id, gm_articles) %>% 
  unnest(gm_articles)

gm_rank <- gm_unnest %>% unnest_sentences("sentences", gm_articles) %>% 
  bind_lexrank(sentences, sent_id, level = "sentences")

gm_rank <- gm_rank %>% group_by(doc_id) %>% 
  arrange(desc(lexrank)) %>% 
  arrange(doc_id)

gm_rank <- gm_rank %>% 
  select(doc_id, sentences, lexrank) %>% 
  group_by(doc_id) %>% 
  top_n(2, lexrank)

AdamSpannbauer · 2017-11-25T19:43:57Z

It depends on what you are trying to accomplish. Your code will give you highest lexRanked sentences per doc, but the lexRank itself is being calculated with respect to the whole corpus. So the sentences returned are not the most representative sentences per-document, but rather the most representative sentences of the corpus. If that is what you are trying to accomplish then your solution works.

The code I provided will return the highest lexRanked sentences per document. If this is your goal you will also benefit from a performance boost (since executing lexRank on a full corpus can be computationally expensive). Below is an extension of the gist I posted using purrr::possibly() to add some easy tryCatch logic.

safe_top_sent = purrr::possibly(get_top_sentences, otherwise = NULL, quiet = FALSE)
#get top sentence(s) per document
#split into a list with document dfs as elements
gm_rank_doc_level = split(gm_unnest, gm_unnest$doc_id) %>% 
  #apply lexrank function to extract top n ranked sentences
  map(safe_top_sent, text_col="gm_articles", n=2) %>% 
  #recombine into single df
  bind_rows()

Additionally, in your code you call unnest_sentences("sentences", gm_articles). If using unnest_sentences you don't need to quote your column names (it works fine if you do, you just don't have to). So your call could be: unnest_sentences(sentences, gm_articles). In the gist I used unnest_sentences_ which requires character column names instead of unquoted names; I did this to easily parameterize one of the inputs in the custom function.

Monduiz · 2017-11-25T19:48:35Z

I am trying to get the highest ranked sentences per document. I will go back to your code. I wasn't able to make it work at first and the split() line was confusing me.

AdamSpannbauer · 2017-11-25T19:49:18Z

I'll look into finding the best way to add this functionality to the package so the process is simpler.

Monduiz · 2017-11-25T19:50:25Z

I am still trying to adapt the get_top_sentences function as it has not been working for me. Thank you so much for your insights!

AdamSpannbauer · 2017-11-25T19:58:20Z

Your full script runs for me if I make the safe_top_sent modification in my 2nd comment. There were 2 articles that ran into errors in the lexRank process.

If the fix mentioned doesn't resolve your issue please post your error and we can work through it.

Monduiz · 2017-11-25T20:06:28Z

Ok, I do get an error but I think I am not using your modification correctly.

Error: All values in sentence term tfidf matrix are 0.  Similarities would return as NaN

So, I am assuming the start is after gm_unnest ?

gm_unnest <- gm %>% 
  select(doc_id, gm_articles) %>% 
  unnest(gm_articles)

From there, is the function get_top_sentences also modified in some way? then you run your modification?

AdamSpannbauer · 2017-11-25T20:40:03Z

I combined your code and a possible solution using the get_top_sentences() function into this gist (new code starting at line 48).

The script runs into 2 errors during the lexRank process (so we will have 2 documents missing from our results), but the tryCatch type logic allows the script to continue processing.

Monduiz · 2017-11-25T20:45:51Z

This is working nicely! Thank you, this is great! If you do look into simplifying this process, I would love to test it!

AdamSpannbauer closed this as completed Nov 25, 2017

AdamSpannbauer mentioned this issue Nov 26, 2017

add helper for multiple doc lexranking (within doc) #9

Open

AdamSpannbauer added the question label Dec 11, 2017

cspenn mentioned this issue Sep 23, 2018

Not able to get single top sentence #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purr #8

Purr #8

Monduiz commented Nov 23, 2017

AdamSpannbauer commented Nov 24, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017

Purr #8

Purr #8

Comments

Monduiz commented Nov 23, 2017

AdamSpannbauer commented Nov 24, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017

AdamSpannbauer commented Nov 25, 2017

Monduiz commented Nov 25, 2017