Skip to content

Custom Scraper

GeoffreyChen777 edited this page Sep 19, 2022 · 11 revisions

This page introduce how to setup your custom scraper.

My research topic is computer vision, which is only one piece of puzzle of the computer science. If the builtin metadata scrapers are not suitable for your research, you can write your own metadata scraper.

Design

A metadata scraper consists of three main functions: preProcess, parsingProcess, and scrapeImpl. The return values of the preProcess function usually are three elements: scrapeURL, headers, enable. parsingProcess parses the response of the database API url scrapeURL and assigns metadata to a paper entity draft: paperEntityDraft. This paperEntityDraft will go through all enabled scrapers and finally be inserted or updated to the Paperlib database. scrapeImpl firstly calls the preProcess, then does the network requesting, and finally calls the parsingProcess.

Add a custom scraper

Open the preference window, click the scrapers tab, click the + button.

Screenshot 2022-07-07 at 22 58 09

scrapeImpl

The default scrapeImpl function is:

  const { scrapeURL, headers, enable } = this.preProcess(
    paperEntityDraft
  );

  if (enable || force) {
    const response = (await window.networkTool.get(
      scrapeURL,
      headers
    ));
    return this.parsingProcess(response, paperEntityDraft);
  } else {
    return paperEntityDraft;
  }

Usually, it is unnecessary to modify this function.

preProcess

Let's use the built-in DOI scraper as an example.

    const enable =
      paperEntityDraft.doi !== "" &&
      this.getEnable("doi") &&
      this.isPreprint(paperEntityDraft);
    const doiID = formatString({
      str: paperEntityDraft.doi,
      removeNewline: true,
      removeWhite: true,
    });
    const scrapeURL = `https://dx.doi.org/${doiID}`;
    const headers = {
      Accept: "application/json",
    };

    if (enable) {
      this.stateStore.logState.processLog = `Scraping metadata by DOI ...`;
    }

    return { scrapeURL, headers, enable };

This function firstly determines whether this scraper should be enabled or not. Here, if the paperEntityDraft has a valid doi property, you enabled this scraper in the preference window, and this is a preprint paper. the enable would be true.

After that, we construct the scrapeURL.

Some API requires specific HTTP header, then we set it.

Finally, we send a message to Paperlib that your scraper are going to scrape the metadata of this paper.

parsingProcess

    const response = JSON.parse(rawResponse.body)
    const title = response.title;
    const authors = response.author
      .map((author) => {
        return author.given.trim() + " " + author.family.trim();
      })
      .join(", ");
    const pubTime = response.published["date-parts"]["0"][0];
    let pubType;
    if (response.type == "proceedings-article") {
      pubType = 1;
    } else if (response.type == "journal-article") {
      pubType = 0;
    } else {
      pubType = 2;
    }
    const publication = response["container-title"];

    paperEntityDraft.setValue("title", title);
    paperEntityDraft.setValue("authors", authors);
    paperEntityDraft.setValue("pubTime", `${pubTime}`);
    paperEntityDraft.setValue("pubType", pubType);
    paperEntityDraft.setValue("publication", publication);
    if (response.volume) {
      paperEntityDraft.setValue("volume", response.volume);
    }
    if (response.issue) {
      paperEntityDraft.setValue("number", response.issue);
    }
    if (response.page) {
      paperEntityDraft.setValue("pages", response.page);
    }
    if (response.publisher) {
      paperEntityDraft.setValue(
        "publisher",
        response.publisher ===
          "Institute of Electrical and Electronics Engineers (IEEE)"
          ? "IEEE"
          : response.publisher
      );
    }
    return paperEntityDraft;

The parsingProcess is very easy to understand. It just parses the rawResponse and assign corresponding values to the paperEntityDraft.

Here you can use console.log(rawResponse) and console.log(paperEntityDraft) in this function to output the structure of these to input variables. You can find the log in the developer tools window (option+cmd+I).

Args

You may need some configurable args in your scraper. For example, some database APIs, such as the IEEE xplore, may require a APIkey. Here, you can access the args in your configuration as:

const ieeeAPIKey = (this.preference.get("scrapers")["ieee"]?.args ?? "";

Use custom scraper to implement auto tagger

This section introduces how to use the custom scraper feature to implement a auto tagger to automatically tag your newly-imported papers.

Screenshot 2022-08-14 at 17 11 30

Name

auto-tagger

Args

{"semi-supervised":"semi-supervised", "segmentation":"segmentation", "detection":"detection"}

You can define your own rules: {"keywords":"tag name"}

preProcess

enable = paperEntityDraft.tags.length === 0 && this.getEnable("auto-tagger");

Parsing Process

const tagMap = JSON.parse(this.preference.get("scrapers")["auto-tagger"].args);

const title = paperEntityDraft.title.toLowerCase();
    
let autoTags = []
for (const [key, tag] of Object.entries(tagMap)) {

  if (title.includes(key.toLowerCase())) {
    autoTags.push(new PaperTag(tag, 1))
  }
}

paperEntityDraft.setValue("tags", autoTags);

ScrapeImpl

const { scrapeURL, headers, enable } = this.preProcess(
  paperEntityDraft
);

if (enable) {
  return this.parsingProcess('', paperEntityDraft);
} else {
  return paperEntityDraft;
}

Clone this wiki locally