Skip to content

Commit

Permalink
Solr harvester (#43)
Browse files Browse the repository at this point in the history
* rename spotlight_oaipmh_harvesters to spotlight_harvesters

* add :type colum to harvesters table

The :type column will allow us to use Single Table Inheritance for the OAIPMH and Solr harvesters.

The migration also includes a data migration to add a value for :type for all existing harvesters.

* create Harvester and SolrHarvester models

Sets up Harvester as the "base" class for the two harvest types. Migrates some common logic from OaipmhHarvester to Harvester inline with this.

* repurpose OaipmhHarvesterController to handle Solr harvests as well, add HarvestType

These changes are strongly inspired by changes found in the job_entry branch, originally authored by @ives1227

* update routes to use new "base" harvester model

* port SolrHarvester logic from job_entry branch (e905c7c)

* rework HarvesterController for clarity

* add db column for solr harvest mapping filenames

* add solr option to harvester form

Builds UI that allows creation of SolrHarvesters. Combines changes from the deleted form partial with the one found in e905c7c

* locales from e905c7c

Now that solr harvests are an option, this generalizes language on the harvester form to not just refer to MODS harvests

* extract common logic into base harvester model

* bring over default solr mapping file from job_entry branch

* bring over unmodified logic relevant to solr harvests from the job_entry branch

* rename SolrHarvestingItem to SolrHarvestingParser to better reflect its purpose

* move logic from SolrHarvestingBuilder into SolrHarvester

In Spotlight v3.3.0, the builder pattern is no longer used. This first step is extracting the existing logic (untouched so far) so we can delete the builder.

* create SolrUpload model

This will be the SQL copy of items harvested by the SolrHarvester

* generalize #get_mapping_file method, remove most references to "resource" from SolrHarvester

* persist harvested solr data in SolrUploads

* rename #get_harvests to #solr_harvests

* simplify getting data from solr

* properly connect to solr url

* align harvester method names

* add job progress total logic to solr harvester

* finalize updating job tracker logic

* remove unused cursor/schedule logic

Before, every "row" of solr data was harvested in a separate job. Due to how JobTrackers work in Spotlight v3.3.0 (one tracker per job), we run the whole solr harvest in a single PerformHarvestJob

* rearrange methods in SolrHarvester

* extract single item harvesting logic to SolrHarvester#harvest_item

* generalize common error handling logic

* add @sidecar_id tracking to SolrHarvester

* fix namespace errors, simplify unique_id_field logic, remove unused #get_unique_id_field_name method

* override SolrUpload#compound_id

This allows us to connect URNs with the correct SolrDocuments

* custom metadata gets indexed on initial solr harvest

* fix updating existing solr harvest items' metadata

* upgraded everything let's see how it goes (#42)

* account for missing trailing "/" in base_url

* fix undefined error_msg bug

* Update jbsd.yml (#45)

* strip whitespaces from URNs when harvesting (#44)

* Update email template (#46)

* Update email per VV

* Update en.yml

* Fixing bug with set name in title. Fixing typo.

* Change the subject line

Co-authored-by: Maura C <[email protected]>

* Updating version to 3.0.0-beta.12. (#47)

* implement cursor search for solr harvesters

Replaces paginated search, which was failing due to the Solr server's restrictions

* Update solr_harvester.rb

* Update solr_harvester.rb

* add config file to declare the unique key for each Solr set

This is meant to be a temporary solution due to data structure inconsistencies between the Solr sets; specifically, the fact that they currently use different fields as the unique key

* Bumping version to 3.0.0-beta.13.

Co-authored-by: dl-maura <[email protected]>
Co-authored-by: Phil Plencner <[email protected]>
Co-authored-by: Phil Plencner <[email protected]>
  • Loading branch information
4 people authored Oct 31, 2022
1 parent c01ca6d commit af405a4
Show file tree
Hide file tree
Showing 22 changed files with 695 additions and 133 deletions.
2 changes: 1 addition & 1 deletion Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ GIT
PATH
remote: .
specs:
spotlight-oaipmh-resources (3.0.0.pre.beta.12)
spotlight-oaipmh-resources (3.0.0.pre.beta.13)
mods
oai

Expand Down
70 changes: 70 additions & 0 deletions app/controllers/spotlight/resources/harvester_controller.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@

module Spotlight::Resources
class HarvesterController < Spotlight::ApplicationController

load_and_authorize_resource :exhibit, class: Spotlight::Exhibit

# POST /harvester
def create
upload if resource_params.has_key?(:custom_mapping)

harvester = build_harvester_by_type(resource_params[:type])
if harvester.save
Spotlight::Resources::PerformHarvestsJob.perform_later(harvester: harvester, user: current_user)
flash[:notice] = t('spotlight.resources.harvester.performharvest.success', set: resource_params[:set])
else
flash[:error] = "Failed to create harvester for #{resource_params[:set]}. #{harvester.errors.full_messages.to_sentence}"
end
redirect_to spotlight.admin_exhibit_catalog_path(current_exhibit, sort: :timestamp)
end

private

def upload
name = resource_params[:custom_mapping].original_filename
Dir.mkdir("public/uploads") unless Dir.exist?("public/uploads")
dir = "public/uploads/modsmapping"
if (resource_params[:type] == Spotlight::HarvestType::SOLR)
dir = "public/uploads/solrmapping"
end
Dir.mkdir(dir) unless Dir.exist?(dir)

path = File.join(dir, name)
File.open(path, "w") { |f| f.write(resource_params[:custom_mapping].read) }
end

def build_harvester_by_type(type)
if type == Spotlight::HarvestType::MODS
Spotlight::OaipmhHarvester.new(
base_url: resource_params[:url],
set: resource_params[:set],
mods_mapping_file: mapping_file(type),
exhibit: current_exhibit
)
else
Spotlight::SolrHarvester.new(
base_url: resource_params[:url],
set: resource_params[:set],
solr_mapping_file: mapping_file(type),
exhibit: current_exhibit
)
end
end

def mapping_file(type)
return resource_params[:custom_mapping].original_filename if resource_params[:custom_mapping].present?

mapping_file = if type == Spotlight::HarvestType::MODS
resource_params[:mods_mapping_file]
else
resource_params[:solr_mapping_file]
end

mapping_file
end

def resource_params
params.require(:harvester).permit(:type, :url, :set, :mods_mapping_file, :solr_mapping_file, :custom_mapping)
end
end
end
54 changes: 0 additions & 54 deletions app/controllers/spotlight/resources/oaipmh_harvester_controller.rb

This file was deleted.

2 changes: 1 addition & 1 deletion app/jobs/spotlight/resources/perform_harvests_job.rb
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def perform(harvester:, user: nil)
@exhibit = harvester.exhibit
@set = harvester.set
@user = user
@sidecar_ids = harvester.harvest_oai_items(job_tracker: job_tracker, job_progress: progress)
@sidecar_ids = harvester.harvest_items(job_tracker: job_tracker, job_progress: progress)
@total_errors = harvester.total_errors
@total_warnings = 0

Expand Down
7 changes: 7 additions & 0 deletions app/models/spotlight/harvest_type.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
module Spotlight
class HarvestType
MODS = "MODS"
SOLR = "Solr"
HARVEST_TYPES = [MODS, SOLR]
end
end
46 changes: 46 additions & 0 deletions app/models/spotlight/harvester.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
module Spotlight
class Harvester < ActiveRecord::Base
belongs_to :exhibit

attr_accessor :total_errors

validates :base_url, presence: true
validates :set, presence: true

def self.mapping_files(dir_name)
if (Dir.exist?("public/uploads/#{dir_name}"))
files = Dir.entries("public/uploads/#{dir_name}")
files.delete('.')
files.delete('..')
else
files = Array.new
end

files.insert(0, 'New Mapping File')
files.insert(0, 'Default Mapping File')
files
end

def handle_item_harvest_error(error, parsed_item, job_tracker = nil)
error_msg = parsed_item.id + ' did not index successfully:'
Delayed::Worker.logger.add(Logger::ERROR, error_msg)
Delayed::Worker.logger.add(Logger::ERROR, error.message)
Delayed::Worker.logger.add(Logger::ERROR, error.backtrace)
if job_tracker.present?
job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: error_msg)
job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: error.message)
end
self.total_errors += 1
end

def update_progress_total(job_progress)
job_progress.total = complete_list_size
end

def get_mapping_file
return if mapping_file.eql?('Default Mapping File') || mapping_file.eql?('New Mapping File')

mapping_file
end
end
end
43 changes: 5 additions & 38 deletions app/models/spotlight/oaipmh_harvester.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,29 +3,14 @@
require 'uri'

module Spotlight
class OaipmhHarvester < ActiveRecord::Base
belongs_to :exhibit

validates :base_url, presence: true
validates :set, presence: true

attr_accessor :total_errors
class OaipmhHarvester < Harvester
alias_attribute :mapping_file, :mods_mapping_file

def self.mapping_files
if (Dir.exist?('public/uploads/modsmapping'))
files = Dir.entries('public/uploads/modsmapping')
files.delete('.')
files.delete('..')
else
files = Array.new
end

files.insert(0, 'New Mapping File')
files.insert(0, 'Default Mapping File')
files
super('modsmapping')
end

def harvest_oai_items(job_tracker: nil, job_progress: nil)
def harvest_items(job_tracker: nil, job_progress: nil)
self.total_errors = 0
@sidecar_ids = []
harvests = oaipmh_harvests
Expand Down Expand Up @@ -94,15 +79,7 @@ def harvest_item(record, job_tracker, job_progress)

job_progress&.increment
rescue Exception => e
error_msg = parsed_oai_item.id + ' did not index successfully:'
Delayed::Worker.logger.add(Logger::ERROR, error_msg)
Delayed::Worker.logger.add(Logger::ERROR, e.message)
Delayed::Worker.logger.add(Logger::ERROR, e.backtrace)
if job_tracker.present?
job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: error_msg)
job_tracker.append_log_entry(type: :error, exhibit: exhibit, message: e.message)
end
self.total_errors += 1
handle_item_harvest_error(e, parsed_oai_item, job_tracker)
end

def oaipmh_harvests
Expand Down Expand Up @@ -131,15 +108,5 @@ def client
def oai_mods_converter
@oai_mods_converter ||= Spotlight::Resources::OaipmhModsConverter.new(set, exhibit.slug, get_mapping_file)
end

def update_progress_total(job_progress)
job_progress.total = complete_list_size
end

def get_mapping_file
return if mapping_file.eql?('Default Mapping File') || mapping_file.eql?('New Mapping File')

mapping_file
end
end
end
Loading

0 comments on commit af405a4

Please sign in to comment.