Solr harvester (#43)

* rename spotlight_oaipmh_harvesters to spotlight_harvesters * add :type colum to harvesters table The :type column will allow us to use Single Table Inheritance for the OAIPMH and Solr harvesters. The migration also includes a data migration to add a value for :type for all existing harvesters. * create Harvester and SolrHarvester models Sets up Harvester as the "base" class for the two harvest types. Migrates some common logic from OaipmhHarvester to Harvester inline with this. * repurpose OaipmhHarvesterController to handle Solr harvests as well, add HarvestType These changes are strongly inspired by changes found in the job_entry branch, originally authored by @ives1227 * update routes to use new "base" harvester model * port SolrHarvester logic from job_entry branch (e905c7c) * rework HarvesterController for clarity * add db column for solr harvest mapping filenames * add solr option to harvester form Builds UI that allows creation of SolrHarvesters. Combines changes from the deleted form partial with the one found in e905c7c * locales from e905c7c Now that solr harvests are an option, this generalizes language on the harvester form to not just refer to MODS harvests * extract common logic into base harvester model * bring over default solr mapping file from job_entry branch * bring over unmodified logic relevant to solr harvests from the job_entry branch * rename SolrHarvestingItem to SolrHarvestingParser to better reflect its purpose * move logic from SolrHarvestingBuilder into SolrHarvester In Spotlight v3.3.0, the builder pattern is no longer used. This first step is extracting the existing logic (untouched so far) so we can delete the builder. * create SolrUpload model This will be the SQL copy of items harvested by the SolrHarvester * generalize #get_mapping_file method, remove most references to "resource" from SolrHarvester * persist harvested solr data in SolrUploads * rename #get_harvests to #solr_harvests * simplify getting data from solr * properly connect to solr url * align harvester method names * add job progress total logic to solr harvester * finalize updating job tracker logic * remove unused cursor/schedule logic Before, every "row" of solr data was harvested in a separate job. Due to how JobTrackers work in Spotlight v3.3.0 (one tracker per job), we run the whole solr harvest in a single PerformHarvestJob * rearrange methods in SolrHarvester * extract single item harvesting logic to SolrHarvester#harvest_item * generalize common error handling logic * add @sidecar_id tracking to SolrHarvester * fix namespace errors, simplify unique_id_field logic, remove unused #get_unique_id_field_name method * override SolrUpload#compound_id This allows us to connect URNs with the correct SolrDocuments * custom metadata gets indexed on initial solr harvest * fix updating existing solr harvest items' metadata * upgraded everything let's see how it goes (#42) * account for missing trailing "/" in base_url * fix undefined error_msg bug * Update jbsd.yml (#45) * strip whitespaces from URNs when harvesting (#44) * Update email template (#46) * Update email per VV * Update en.yml * Fixing bug with set name in title. Fixing typo. * Change the subject line Co-authored-by: Maura C <[email protected]> * Updating version to 3.0.0-beta.12. (#47) * implement cursor search for solr harvesters Replaces paginated search, which was failing due to the Solr server's restrictions * Update solr_harvester.rb * Update solr_harvester.rb * add config file to declare the unique key for each Solr set This is meant to be a temporary solution due to data structure inconsistencies between the Solr sets; specifically, the fact that they currently use different fields as the unique key * Bumping version to 3.0.0-beta.13. Co-authored-by: dl-maura <[email protected]> Co-authored-by: Phil Plencner <[email protected]> Co-authored-by: Phil Plencner <[email protected]>
harvard-lts · Oct 31, 2022 · af405a4 · af405a4
1 parent c01ca6d
commit af405a4
Show file tree

Hide file tree

Showing 22 changed files with 695 additions and 133 deletions.
diff --git a/Gemfile.lock b/Gemfile.lock
@@ -48,7 +48,7 @@ GIT
 PATH
   remote: .
   specs:
-    spotlight-oaipmh-resources (3.0.0.pre.beta.12)
+    spotlight-oaipmh-resources (3.0.0.pre.beta.13)
       mods
       oai
 

diff --git a/app/controllers/spotlight/resources/harvester_controller.rb b/app/controllers/spotlight/resources/harvester_controller.rb
@@ -0,0 +1,70 @@
+
+module Spotlight::Resources
+  class HarvesterController < Spotlight::ApplicationController
+
+    load_and_authorize_resource :exhibit, class: Spotlight::Exhibit
+
+    # POST /harvester
+    def create
+      upload if resource_params.has_key?(:custom_mapping)
+
+      harvester = build_harvester_by_type(resource_params[:type])
+      if harvester.save
+        Spotlight::Resources::PerformHarvestsJob.perform_later(harvester: harvester, user: current_user)
+        flash[:notice] = t('spotlight.resources.harvester.performharvest.success', set: resource_params[:set])
+      else
+        flash[:error] = "Failed to create harvester for #{resource_params[:set]}. #{harvester.errors.full_messages.to_sentence}"
+      end
+      redirect_to spotlight.admin_exhibit_catalog_path(current_exhibit, sort: :timestamp)
+    end
+
+    private
+
+    def upload
+      name = resource_params[:custom_mapping].original_filename
+      Dir.mkdir("public/uploads") unless Dir.exist?("public/uploads")
+      dir = "public/uploads/modsmapping"
+      if (resource_params[:type]  == Spotlight::HarvestType::SOLR)
+        dir = "public/uploads/solrmapping"
+      end
+      Dir.mkdir(dir) unless Dir.exist?(dir)
+
+      path = File.join(dir, name)
+      File.open(path, "w") { |f| f.write(resource_params[:custom_mapping].read) }
+    end
+
+    def build_harvester_by_type(type)
+      if type == Spotlight::HarvestType::MODS
+        Spotlight::OaipmhHarvester.new(
+          base_url: resource_params[:url],
+          set: resource_params[:set],
+          mods_mapping_file: mapping_file(type),
+          exhibit: current_exhibit
+        )
+      else
+        Spotlight::SolrHarvester.new(
+          base_url: resource_params[:url],
+          set: resource_params[:set],
+          solr_mapping_file: mapping_file(type),
+          exhibit: current_exhibit
+        )
+      end
+    end
+
+    def mapping_file(type)
+      return resource_params[:custom_mapping].original_filename if resource_params[:custom_mapping].present?
+
+      mapping_file = if type == Spotlight::HarvestType::MODS
+                       resource_params[:mods_mapping_file]
+                     else
+                       resource_params[:solr_mapping_file]
+                     end
+
+      mapping_file
+    end
+
+    def resource_params
+      params.require(:harvester).permit(:type, :url, :set, :mods_mapping_file, :solr_mapping_file, :custom_mapping)
+    end
+  end
+end
diff --git a/app/controllers/spotlight/resources/oaipmh_harvester_controller.rb b/app/controllers/spotlight/resources/oaipmh_harvester_controller.rb
diff --git a/app/jobs/spotlight/resources/perform_harvests_job.rb b/app/jobs/spotlight/resources/perform_harvests_job.rb
@@ -26,7 +26,7 @@ def perform(harvester:, user: nil)
       @exhibit = harvester.exhibit
       @set = harvester.set
       @user = user
-      @sidecar_ids = harvester.harvest_oai_items(job_tracker: job_tracker, job_progress: progress)
+      @sidecar_ids = harvester.harvest_items(job_tracker: job_tracker, job_progress: progress)
       @total_errors = harvester.total_errors
       @total_warnings = 0
 

diff --git a/app/models/spotlight/harvest_type.rb b/app/models/spotlight/harvest_type.rb
@@ -0,0 +1,7 @@
+module Spotlight
+  class HarvestType
+    MODS = "MODS"
+    SOLR = "Solr"
+    HARVEST_TYPES = [MODS, SOLR]
+  end
+end
diff --git a/app/models/spotlight/harvester.rb b/app/models/spotlight/harvester.rb
@@ -0,0 +1,46 @@
+module Spotlight
+  class Harvester < ActiveRecord::Base
+    belongs_to :exhibit
+
+    attr_accessor :total_errors
+
+    validates :base_url, presence: true
+    validates :set, presence: true
+
+    def self.mapping_files(dir_name)
+      if (Dir.exist?("public/uploads/#{dir_name}"))
+        files = Dir.entries("public/uploads/#{dir_name}")
+        files.delete('.')
+        files.delete('..')
+      else
+        files = Array.new
+      end
+
+      files.insert(0, 'New Mapping File')
+      files.insert(0, 'Default Mapping File')
+      files
+    end
+
+    def handle_item_harvest_error(error, parsed_item, job_tracker = nil)
+      error_msg = parsed_item.id + ' did not index successfully:'
+      Delayed::Worker.logger.add(Logger::ERROR, error_msg)
+      Delayed::Worker.logger.add(Logger::ERROR, error.message)
+      Delayed::Worker.logger.add(Logger::ERROR, error.backtrace)
+      if job_tracker.present?
+        job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: error_msg)
+        job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: error.message)
+      end
+      self.total_errors += 1
+    end
+
+    def update_progress_total(job_progress)
+      job_progress.total = complete_list_size
+    end
+
+    def get_mapping_file
+      return if mapping_file.eql?('Default Mapping File') || mapping_file.eql?('New Mapping File')
+
+      mapping_file
+    end
+  end
+end
diff --git a/app/models/spotlight/oaipmh_harvester.rb b/app/models/spotlight/oaipmh_harvester.rb
@@ -3,29 +3,14 @@
 require 'uri'
 
 module Spotlight
-  class OaipmhHarvester < ActiveRecord::Base
-    belongs_to :exhibit
-
-    validates :base_url, presence: true
-    validates :set, presence: true
-
-    attr_accessor :total_errors
+  class OaipmhHarvester < Harvester
+    alias_attribute :mapping_file, :mods_mapping_file
 
     def self.mapping_files
-      if (Dir.exist?('public/uploads/modsmapping'))
-        files = Dir.entries('public/uploads/modsmapping')
-        files.delete('.')
-        files.delete('..')
-      else
-        files = Array.new
-      end
-
-      files.insert(0, 'New Mapping File')
-      files.insert(0, 'Default Mapping File')
-      files
+      super('modsmapping')
     end
 
-    def harvest_oai_items(job_tracker: nil, job_progress: nil)
+    def harvest_items(job_tracker: nil, job_progress: nil)
       self.total_errors = 0
       @sidecar_ids = []
       harvests = oaipmh_harvests
@@ -94,15 +79,7 @@ def harvest_item(record, job_tracker, job_progress)
 
       job_progress&.increment
     rescue Exception => e
-      error_msg = parsed_oai_item.id + ' did not index successfully:'
-      Delayed::Worker.logger.add(Logger::ERROR, error_msg)
-      Delayed::Worker.logger.add(Logger::ERROR, e.message)
-      Delayed::Worker.logger.add(Logger::ERROR, e.backtrace)
-      if job_tracker.present?
-        job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: error_msg)
-        job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: e.message)
-      end
-      self.total_errors += 1
+      handle_item_harvest_error(e, parsed_oai_item, job_tracker)
     end
 
     def oaipmh_harvests
@@ -131,15 +108,5 @@ def client
     def oai_mods_converter
       @oai_mods_converter ||= Spotlight::Resources::OaipmhModsConverter.new(set, exhibit.slug, get_mapping_file)
     end
-
-    def update_progress_total(job_progress)
-      job_progress.total = complete_list_size
-    end
-
-    def get_mapping_file
-      return if mapping_file.eql?('Default Mapping File') || mapping_file.eql?('New Mapping File')
-
-      mapping_file
-    end
   end
 end
-Original file line number
+Diff line change
@@ Expand Up / @@ -48,7 +48,7 @@ GIT @@
     PATH
       remote: .
       specs:
-        spotlight-oaipmh-resources (3.0.0.pre.beta.12)
+        spotlight-oaipmh-resources (3.0.0.pre.beta.13)
           mods
           oai
@@ Expand Down @@