Support for Upsert operation in IKernelMemory #560

drelmas · 2024-01-04T18:00:17Z

drelmas
Jan 4, 2024

Following off of #85

I've observed that the documentId parameter in the IKernelMemory.Import<*>Async methods is actually used as the value for a reserved tag __document_id when importing to a vector storage (Azure AI Search in my case). Since it is not used as the primary key for the entity in the index, I can upload multiple pieces of information with the same documentId which count as unique objects in the index grouped together by tag.

I understand the benefit of doing this if you were separately uploading parts of a larger file, but it also means that there is no easy Update mechanism if my intention is to completely replace everything associated with the documentId in question.

If my source content changes (and potentially has conflicting information with what is already in the index), I would love for deletion of the old information in the index to be part of the SDK. My workaround for this is (in pseudocode):

if (await IKernelMemory.IsDocumentReadyAsync(documentId)) 
{
    await IKernelMemory.DeleteDocumentAsync(documentId)
}

await IKernelMemory.Import<*>Async(...)

Is there any benefit to an adding an Upsert or CreateOrUpdate operation natively?

dluc · 2024-01-04T22:06:07Z

dluc
Jan 4, 2024
Maintainer

it also means that there is no easy Update mechanism if my intention is to completely replace everything associated with the documentId in question.

hi @drelyea when uploading a document with the same ID, the resulting operation is equivalent to an Upsert. All the previous information is replaced. For instance if you upload a PDF with ID "foo" and then upload a Word doc with the same ID "foo", the content of the PDF is replaced with the content of the Word doc. Same if you upload multiple files under the document ID (a document can be composed of multiple files).

Perhaps the "Import" name is confusing, but I can assure it's designed to work this way:

If a Document ID is provided => Upsert, ie Replace
If a Document ID is not provided => Insert new record, return new Document ID in the respose

0 replies

drelmas · 2024-01-04T22:50:18Z

drelmas
Jan 4, 2024
Author

Hey @dluc! Thanks for getting back to me. This seems at odds with the behavior I observe, at least with ImportText.

uploading a document with the same ID, the resulting operation is equivalent to an Upsert

Is this true for both ImportDocument and ImportText?

As an example, I call IKernelMemory.ImportText at two different points in time with the same documentId but conflicting information:

// Call 1
memory.ImportTextAsync(
    text: "Researchers from the International Society of Frog Enthusiasts have determined there are exactly 386 kinds of Frog in the world", // Random text
    documentId: "2cc8d19a-4fa1-484b-8f27-1a1b72f653f2"); // A specific guid

// Call 2
memory.ImportTextAsync(
    text: "Researchers from the International Society of Frog Enthusiasts have determined there are exactly 5 kinds of Frog in the world", // Random text
    documentId: "2cc8d19a-4fa1-484b-8f27-1a1b72f653f2"); // A specific guid

When I look at my index in Azure Search Service searching for 'frog', I can see 2 distinct entities with matching __document_id tag:

"value": [
    {
      "id": " [some distinct key] ",
      "tags": [
        "__document_id:2cc8d19a-4fa1-484b-8f27-1a1b72f653f2",
        ....
      ],
      "payload": "{\"url\":\"\",\"schema\":\"20231218A\",\"file\":\"content.txt\",\"text\":\"Researchers from the International Society of Frog Enthusiasts have determined there are exactly 386 kinds of Frog in the world\",\"vector_provider\":\"AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator\",\"vector_generator\":\"TODO\",\"last_update\":\"2024-01-04T22:25:59\"}",
      "embedding": [ <taken out for length> ]
    },
    {
      "id": " [some distinct key] ",
      "tags": [
        "__document_id:2cc8d19a-4fa1-484b-8f27-1a1b72f653f2",
        ....
      ],
      "payload": "{\"url\":\"\",\"schema\":\"20231218A\",\"file\":\"content.txt\",\"text\":\"Researchers from the International Society of Frog Enthusiasts have determined there are exactly 5 kinds of Frog in the world\",\"vector_provider\":\"AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator\",\"vector_generator\":\"TODO\",\"last_update\":\"2024-01-04T22:27:00\"}",
      "embedding": [ <taken out for length> ]
    }
]

Finally, when I call IKernelMemory.AskAsync("How many kinds of frogs are there?"), I get back:

There are conflicting reports. One source states that there are exactly 386 kinds of frog in the world, while another source states that there are only 5 kinds of frog in the world. Therefore, the exact number of frog species is unclear.

And the MemoryAnswer.RelevantSources object clearly shows 2 distinct sources with the same __document_id tag value.

0 replies

drelmas · 2024-01-04T23:36:46Z

drelmas
Jan 4, 2024
Author

I believe I may have found the answer after looking into BaseOrchestrator - I'm using the MemoryServerless implementation, currently invoking via a CLI which instantiates and tears down dependencies after every SDK call. I've also omitted adding a true Azure Storage Content Storage dependency.

If the update operation depends on persisted pipeline records between operations, that would absolutely explain the behavior I see. I'll do a little more digging and see if this is the case.

0 replies

dluc · 2024-01-05T02:45:13Z

dluc
Jan 5, 2024
Maintainer

If the update operation depends on persisted pipeline records between operations, that would absolutely explain the behavior I see. I'll do a little more digging and see if this is the case.

thanks for investigating, yes I think you're on the right track. All Import* methods act as Upsert, and for the upsert logic to work, they require persistence in the content storage where the ID uniqueness/existence is detected. Serverless memory by default is volatile, and that would explain what you're seeing.

If you need Serverless Memory to be fully persistent:

set a content storage. The default SimpleFileStorage is ok, but you need to set StorageType to Disk.
set a vector storage. The default SimpleVectorDb is ok for tests/demos, but you need to set StorageType to Disk. If you want something more performant but local, I'd suggest using Qdrant or Postgres (other options are coming soon)

If by any chance you're setting Serverless to use queues, I would avoid using SimpleQueues and opt for Azure Queues or RabbitMQ. Or just don't use queues with Serverless, because it's an odd setup :-)

0 replies

drelmas · 2024-01-05T15:26:03Z

drelmas
Jan 5, 2024
Author

Appreciate it - I'll look into these options! Looks like I was using a persisted vector storage, but not a persisted content storage.

I also verified by an integration test running the two ImportText operations one after another without teardown in the middle and did observe upsert working properly. Persistence is what I was missing, I had assumed uniqueness came as part of the Azure Search dependency and not the pipeline content storage.

Thanks for the help!

0 replies

pjirsa · 2024-06-10T12:58:29Z

pjirsa
Jun 10, 2024

I am not using serverless. I have an instance of Kernel Memory service hosted as a container app running off of the latest docker container image. When I use the /upload endpoint to store a document, it is 100% duplicating entries in the index instead of upserting. Please advise.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Upsert operation in IKernelMemory #560

{{title}}

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Support for Upsert operation in IKernelMemory #560

drelmas Jan 4, 2024

Replies: 6 comments

dluc Jan 4, 2024 Maintainer

drelmas Jan 4, 2024 Author

drelmas Jan 4, 2024 Author

dluc Jan 5, 2024 Maintainer

drelmas Jan 5, 2024 Author

pjirsa Jun 10, 2024

drelmas
Jan 4, 2024

dluc
Jan 4, 2024
Maintainer

drelmas
Jan 4, 2024
Author

drelmas
Jan 4, 2024
Author

dluc
Jan 5, 2024
Maintainer

drelmas
Jan 5, 2024
Author

pjirsa
Jun 10, 2024