Skip to content

Latest commit

 

History

History
652 lines (489 loc) · 16.2 KB

API.md

File metadata and controls

652 lines (489 loc) · 16.2 KB

API reference

Opening and Closing

Reading

Writing

Syncing

Options and Settings

(viewable under index.options)

Opening and Closing

Initialization

Make sure that search-index is npm installed and then do either

const searchIndex = require('search-index')
searchIndex(options, function(err, si) {
  // si is now a new search index
})

or

require('search-index')(options, function(err, si) {
  // si is now a new search index
})

or

const getData = function(err, si) {
  // si is now a new search index
}
require('search-index')(options, getData)

options can be any [option](#options and settings) which will then form the defaults of the index until it is closed, or the options are changed.

close(...)

Closes the index and the underlying data store. Generally this should happen automatically, and therefore this function need never be called. Only needed if you are doing very fast restarts.

index.close(function(err) {
  if (!err) console.log('success!')
})

Reading

availableFields(...)

Returns a readable stream of all fields that can be searched in.

si.availableFields().on('data', function (field) {
  // "field" is the name of a field that is searchable
}).on('end', function () {
  // done
})

buckets(...)

Return a readable stream of user defined aggregations, can be used to generate categories by price, age, etc.

  si.buckets({
    query: [
      {
        AND: {'*': ['*']}
      }
    ],
    buckets: [
      {
        field: 'price',
        gte:   '2',
        lte:   '3',
        set:   false
      }
      {
        field: 'price',
        gte:   '4',
        lte:   '7',
        set:   false
      }
    ]
  }).on('data', function (data) {
    // do something with the data
  }).on('end', function () {
    // finshed
  })

query is a standard search-index query that will specify a set of documents

buckets is an array of buckets that must be an object with the following 4 properties:

  • field (mandatory) The field name
  • gte "greater that or equal to"
  • limit Limit the entries that will be returned
  • lte "less than or equal to"
  • set if true- return a set of IDs. If false or not set, return a count

categorize(...)

Collate documents under all possible values of the given field name, and return a readable stream

  si.categorize({
    query: [
      {
        AND: {'*': ['swiss', 'watch']}
      }
    ],
    category: {
      field: 'manufacturer'
    }
  }).on('data', function (data) {
    // do something with the data
  }).on('end', function () {
    // finshed
  })

query is a standard search-index query that will specify a set of documents

catogory is an array of objects specifying fields to categorize on:

  • field Name of the field to categorize on
  • set if true- return a set of IDs. If false or not set, return a count

In addition the q object can have offset and pageSize, that will work the same way as for the /search endpoint.

countDocs(...)

Returns the total amount of docs in the index

si.countDocs(function (err, count) {
  console.log('this index contains ' + count + ' documents')
})

get(...)

Gets a document from the corpus

index.get(docIDs).on('data', function (doc) {
  // doc is a document for each ID in docIDs
})
  • docIDs an array of document IDs

match(...)

Use match to create autosuggest and autocomplete functionality. See also here

index.match({
  beginsWith: 'epub'
}).on('data', function (data) {
  data.should.be.exactly(matches.shift())
}).on('end', function () {
  done()
})

options is an object that can contain:

  • beginsWith string default:'' return all words that begin with this string
  • field string default:'*' perform matches on data found in this field
  • threshold number default:3 only preform matches once beginsWith is longer than this number
  • limit number default:10 maximum amount of matches to return
  • type string default:'simple' the type of matcher to use, can only be 'simple' for the time being.

search(...)

Searches in the index. See also here

si.search({
  query: [{
    AND: {'*': ['gigantic', 'teddy', 'bears']}
  }]
}).on('data', function (data) {
  // do something cool with search results
})

q is an object that describes a search query and can contain the following properties:

  • query Object An object that specifies query terms and fields. For example {'title': ['ronald', 'reagan']}. An asterisk can be used as a wildcard for either fieldname or terms {'*': ['ronald', 'reagan']} or {'title': ['*']}. If documents have been indexed with an nGram length of 2 or more, it is possible to search for the phrase 'ronald reagan': {'*': ['ronald reagan']}. AND or NOT conditions can be specified, and queries can be chained together to create OR statements:

    [
      {
        AND: {'*':    ['watch', 'gold'] },
        NOT: {'name': ['apple'] }
      },
      {
        AND: {'*':    ['apple', 'watch'] }
      }
    ]

    Find "watch" AND "gold" in all ("*") fields OR "apple" and "watch" in all fields

  • filters Object One or more objects added to the AND and/or NOT objects that filters the search results. Filters can only be applied to any searchable fields in your index. Filters are commonly used in conjunction with the selection of catgories and buckets.

    [
      {
        AND: {
          '*':    ['watch', 'gold'],
          'price': [{
            gte: '1000',
            lte: '8'
          }]
        }
      }
    ]
  • offset number Sets the start index of the results. In a scenario where you want to go to "page 2" in a resultset of 30 results per page, you would set offset to 30. Page 3 would require an offset of 60, and so on.

  • pageSize number Sets the size of the resultset.

totalHits(...)

Returns a count of the documents for the given query including those hidden by pagination

si.totalHits(q, function (err, count) {
  console.log('the query ' + q + ' gives ' + count + ' results')
})

Writing

add(...)

Returns a writeable stream that can be used to index documents into the search index.

Note that this stream cannot be used concurrently. If documents are being sent on top of one another then it is safer to use concurrentAdd, however add is faster and uses less resources.

add will heed the following batchOptions:

  • appendOnly boolean: default:false If set to true, documents will not be deleted before being added. This is useful when you are creating an index from scratch and want to speed up indexing time. Normally all documents should have unique IDs, or no IDs. If you add a document with appendOnly: true to an index that already contains a document with that ID, then the new document will simply overwrite the old one withough deleting the previous one. In this way advanced users can augment existing documents by setting appendOnly: false.
// s is a Readable stream in object mode
s.pipe(si.defaultPipeline(batchOptions))
  .pipe(si.add())
  .on('finish', function() {
    // complete
  })

concurrentAdd(...)

An alternative to .add(...) that allows adding by passing an array of documents and waiting for a callback. Useful for environments where node streams cannot be constructed (such as browsers).

Note that concurrentAdd queues documents internally, so in a scenario where unordered documents are being added rapidly from many sources concurrentAdd should be used.

  • data is an array of documents
  • batchOptions is an object describing indexing options
mySearchIndex.concurrentAdd(batchOptions, data, function(err) {
  // docs added
})

concurrentDel(...)

An alternative to .del(...) that allows concurrent deletion

Note that concurrentDel queues documents internally, so in a scenario where documents are being deleted a rapidly without waiting for callbacks concurrentDel should be used.

  • documentIDs is an array of document IDs
mySearchIndex.concurrentDel(documentIDs, function(err) {
  // docs deleted
})

defaultPipeline(...)

Prepares a "standard document" (an object where keys become field names, and values become corresponding field values) for indexing. Customised pipeline stages can be inserted before and after processing if required.

options is an object where each key corresponds to a field name in the documents to be indexed and can contain the following settings:

  • compositeField boolean: default:true should a composite (*) field be generated? Setting to false saves space and speeds up indexing, but disables search on all fields
  • defaultFieldOptions Object default options to use for this batch
  • fieldOptions Object overrides defaultFieldOptions, can have the following object values:
    • fieldedSearch boolean, default:true : can searches be carried out on this specific field
    • nGramLength number, default:1 : length of word sequences to be indexed. Use this to capture phrases of more than one word.
    • preserveCase boolean, default:true : preserve the case of the text
    • searchable boolean default:true : is this field searchable?
    • separator regex : A regex in the String.split() format that will be used to tokenize this field
    • sortable boolean default:false : can this field be sorted on? If true field is not searchable
    • stopwords Array, default: require('stopword').en An array of stop words. Other languages than english are available.
    • storeable Array specifies which fields to store in index. You may want to index fields that are not shown in results, for example when dealing with synonyms
    • weight number: default:0 this number will be added to the score for the field allowing some fields to count more or less than others.
    • wildcard boolean: default:true Enables (*) search

del(...)

Deletes one or more documents from the corpus

si.del(docIDs, function(err) {
  return done()
})
  • docIDs an array of document IDs referencing documents that are to be deleted.

feed(...)

Returns a writable stream that allows you to add documents to the index.

index.feed(options)

Options is an object that is passed on to defaultPipeline and in addition contains its own parameters

  • objectMode boolean: default:false Specifies whether the stream will accept objects or strings

flush(...)

Empties the index. Deletes everything.

index.flush(function(err) {
  if (!err) console.log('success!')
})

Syncing

dbReadStream(...)

Use dbReadStream() to create a stream of the underlying key-value store. This can be used to pipe indexes around. You can for example replicate indexes to file, or to other (empty) indexes

// replicate an index to file
si.dbReadStream(options)
   .pipe(JSONStream.stringify('', '\n', ''))
  .pipe(fs.createWriteStream('backup.json'))
  .on('close', function() {
    // done
  })
// replicate an index to another search-index
replicator.dbReadStream({gzip: true})
  .pipe(zlib.createGunzip())
  .pipe(JSONStream.parse())
  .pipe(replicatorTarget2.dbWriteStream())
  .on('close', function () {
    // done
  })

options is an optional object that describes how the stream is formatted

  • gzip If set to true, the readstream will be compressed into the gzip format

dbWriteStream(...)

Use dbWriteStream() to read in an index created by dbReadStream().

si.dbReadStream(options)
  .pipe(fs.createWriteStream('backup.json'))
  .on('close', function() {
    done()
  })

options is an optional object that describes how the stream will be written

  • merge If set to true, the writestream will merge this index with the existing one, if set to false the existing index must be empty

Options and Settings

appendOnly

boolean

When adding docs- dont check to see if it they exist already.

batchsize

number

Specifies how many documents to process, before merging them into the index. When the end of the stream is reached all remaning documents will be merged, even if batchsize is not reached.

compositeField

boolean

Allow search across all fields (*)

db

a levelup instance

The datastore.

fieldedSearch

boolean

If true, then the field is searchable.

fieldOptions

boolean

Contains field specific overrides to global settings

Example on setting options on several fields:

fieldOptions: {
  id: {
    searchable: false
  },
  url: {
    searchable: false
  }
}

preserveCase

true

If true, case is preserved. For example: queries for "Potato" will not match "potato"

storeable

boolean

If true, a cache of the field is stored in the index

searchable

boolean

If true, this field will be searchable, if it is false, the field will be available, but not searchable

indexPath

string

The location of the datastore. If db is specified, then indexPath is ignored

logLevel

string

A bunyan log level.

nGramLength

number or array or object

All valid definitions of nGramLength:

nGramLength = 1                // 1
nGramLength = [1,3]            // 1 & 3
nGramLength = {gte: 1, lte: 3} // 1, 2 & 3

Specifies how to split strings into phrases. See https://www.npmjs.com/package/term-vector for examples

separator

string

Specifies how strings are to be split, using a regex in the String.split() format

stopwords

array An array of stopwords

Arrays of stopwords for the following languages are supplied:

  • ar - Modern Standard Arabic
  • bn - Bengali
  • da - Danish
  • de - German
  • en - English
  • es - Spanish
  • fa - Farsi
  • fr - French
  • hi - Hindi
  • it - Italian
  • ja - Japanese*
  • nl - Dutch
  • no - Norwegian
  • pl - Polish
  • pt - Portuguese
  • ru - Russioan
  • sv - Sweedish
  • zh - Chinese Simplified*

*Some languages like ja Japanese and zh Chinese Simplified have no space between words. For these languages you need to split the text into words before adding it to search-index. You can check out TinySegmenter for Japanese and chinese-tokenizer for Chinese.

Wildcard

boolean

Should wildcard search be generated for this field