Skip to content

Commit

Permalink
fix typos in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
AdamSpannbauer committed Dec 4, 2018
1 parent 7bce0a1 commit 3e209c2
Show file tree
Hide file tree
Showing 11 changed files with 24 additions and 24 deletions.
4 changes: 2 additions & 2 deletions R/lexRank.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@
#' @param docId A vector of document IDs with length equal to the length of \code{text}. If \code{docId == "create"} then doc IDs will be created as an index from 1 to \code{n}, where \code{n} is the length of \code{text}.
#' @param threshold The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated.
#' @param n The number of sentences to return as the extractive summary. The function will return the top \code{n} lexRanked sentences. See \code{returnTies} for handling ties in lexRank.
#' @param returnTies \code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentece with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}.
#' @param returnTies \code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentence with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}.
#' @param usePageRank \code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}.
#' @param damping The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}.
#' @param continuous \code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}.
#' @param sentencesAsDocs \code{TRUE} or \code{FALSE}, indicating whether or not to treat sentences as documents when calculating tfidf scores for similarity. If \code{TRUE}, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization).
#' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from text while tokenizing. If \code{TRUE}, puncuation will be removed. Defaults to \code{TRUE}.
#' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from text while tokenizing. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}.
#' @param removeNum \code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from text while tokenizing. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}.
#' @param toLower \code{TRUE} or \code{FALSE} indicating whether or not to coerce all of text to lowercase while tokenizing. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}.
#' @param stemWords \code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}.
Expand Down
8 changes: 4 additions & 4 deletions R/lexRankFromSimil.R
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
#' Compute LexRanks from pairwise sentence similarities

#' @description Compute LexRanks from sentence pair similarities using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization."
#' @param s1 A character vector of sentence IDs corresponding to the \code{s2} and \code{simil} arguemants.
#' @param s2 A character vector of sentence IDs corresponding to the \code{s1} and \code{simil} arguemants.
#' @param simil A numeric vector of similiarity values that represents the similiarity between the sentences represented by the IDs in \code{s1} and \code{s2}.
#' @param s1 A character vector of sentence IDs corresponding to the \code{s2} and \code{simil} arguments
#' @param s2 A character vector of sentence IDs corresponding to the \code{s1} and \code{simil} arguments
#' @param simil A numeric vector of similarity values that represents the similarity between the sentences represented by the IDs in \code{s1} and \code{s2}.
#' @param threshold The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated.
#' @param n The number of sentences to return as the extractive summary. The function will return the top \code{n} lexRanked sentences. See \code{returnTies} for handling ties in lexRank.
#' @param returnTies \code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentece with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}.
#' @param returnTies \code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentence with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}.
#' @param usePageRank \code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}.
#' @param damping The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}.
#' @param continuous \code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}.
Expand Down
6 changes: 3 additions & 3 deletions R/sentenceSimil.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ NULL
#' Compute distance between sentences

#' @description Compute distance between sentences using modified idf cosine distance from "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization". Output can be used as input to \code{\link{lexRankFromSimil}}.
#' @param sentenceId A character vector of sentence IDs corresponding to the \code{docId} and \code{token} arguemants.
#' @param token A character vector of tokens corresponding to the \code{docId} and \code{sentenceId} arguemants.
#' @param docId A character vector of document IDs corresponding to the \code{sentenceId} and \code{token} arguemants. Can be \code{NULL} if \code{sentencesAsDocs} is \code{TRUE}.
#' @param sentenceId A character vector of sentence IDs corresponding to the \code{docId} and \code{token} arguments
#' @param token A character vector of tokens corresponding to the \code{docId} and \code{sentenceId} arguments
#' @param docId A character vector of document IDs corresponding to the \code{sentenceId} and \code{token} arguments. Can be \code{NULL} if \code{sentencesAsDocs} is \code{TRUE}.
#' @param sentencesAsDocs \code{TRUE} or \code{FALSE}, indicating whether or not to treat sentences as documents when calculating tfidf scores. If \code{TRUE}, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization)
#' @return A 3 column dataframe of pairwise distances between sentences. Columns: \code{sent1} (sentence id), \code{sent2} (sentence id), & \code{dist} (distance between \code{sent1} and \code{sent2}).
#' @references \url{http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html}
Expand Down
2 changes: 1 addition & 1 deletion R/sentenceTokenParse.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#' @description Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other \code{lexRank} functions.
#' @param text A character vector of documents to be parsed into sentences and tokenized.
#' @param docId A character vector of document Ids the same length as \code{text}. If \code{docId=="create"} document Ids will be created.
#' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text} while tokenizing. If \code{TRUE}, puncuation will be removed. Defaults to \code{TRUE}.
#' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text} while tokenizing. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}.
#' @param removeNum \code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from \code{text} while tokenizing. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}.
#' @param toLower \code{TRUE} or \code{FALSE} indicating whether or not to coerce all of \code{text} to lowercase while tokenizing. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}.
#' @param stemWords \code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}.
Expand Down
2 changes: 1 addition & 1 deletion R/tokenize.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ utils::globalVariables(c("smart_stopwords"))

#' Parse the elements of a character vector into a list of cleaned tokens.
#' @param text The character vector to be tokenized
#' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text}. If \code{TRUE}, puncuation will be removed. Defaults to \code{TRUE}.
#' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text}. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}.
#' @param removeNum \code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from \code{text}. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}.
#' @param toLower \code{TRUE} or \code{FALSE} indicating whether or not to coerce all of \code{text} to lowercase. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}.
#' @param stemWords \code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}.
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ devtools::install_github("AdamSpannbauer/lexRankr")
```

## Overview
lexRankr is an R implementation of the LexRank algorithm discussed by Güneş Erkan & Dragomir R. Radev in [LexRank: Graph-based Lexical Centrality as Salience in Text Summarization](http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html). LexRank is designed to summarize a cluster of documents by proposing which sentences subsume the most information in that particular set of documents. The algorithm may not perform well on a set of unclustered/unrelated set of documents. As the white paper's title suggests, the sentences are ranked based on their centrality in a graph. The graph is built upon the pairwise similarities of the sentences (where similarity is measured with a modified idf cosine similiarity function). The paper describes multiple ways to calculate centrality and these options are available in the R package. The sentences can be ranked according to their degree of centrality or by using the Page Rank algorithm (both of these methods require setting a minimum similarity threshold for a sentence pair to be included in the graph). A third variation is Continuous LexRank which does not require a minimum similarity threshold, but rather uses a weighted graph of sentences as the input to Page Rank.
lexRankr is an R implementation of the LexRank algorithm discussed by Güneş Erkan & Dragomir R. Radev in [LexRank: Graph-based Lexical Centrality as Salience in Text Summarization](http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html). LexRank is designed to summarize a cluster of documents by proposing which sentences subsume the most information in that particular set of documents. The algorithm may not perform well on a set of unclustered/unrelated set of documents. As the white paper's title suggests, the sentences are ranked based on their centrality in a graph. The graph is built upon the pairwise similarities of the sentences (where similarity is measured with a modified idf cosine similarity function). The paper describes multiple ways to calculate centrality and these options are available in the R package. The sentences can be ranked according to their degree of centrality or by using the Page Rank algorithm (both of these methods require setting a minimum similarity threshold for a sentence pair to be included in the graph). A third variation is Continuous LexRank which does not require a minimum similarity threshold, but rather uses a weighted graph of sentences as the input to Page Rank.

*note: the lexrank algorithm is designed to work on a cluster of documents. LexRank is built on the idea that a cluster of docs will focus on similar topics*

*note: pairwise sentence similiarity is calculated for the entire set of documents passed to the function. This can be a computationally instensive process (esp with a large set of documents)*
*note: pairwise sentence similarity is calculated for the entire set of documents passed to the function. This can be a computationally instensive process (esp with a large set of documents)*

## Basic Usage
```r
Expand Down
4 changes: 2 additions & 2 deletions man/lexRank.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions man/lexRankFromSimil.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/sentenceSimil.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/sentenceTokenParse.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 3e209c2

Please sign in to comment.