Indexing PDFs broken when using Elasticsearch and Azure Media Storage features together #17291

cbadger-montecitobank · 2024-12-31T01:20:12Z

Describe the bug

Indexing PDFs using the Elasticsearch feature with the Azure Media Storage feature appears broken after upgrading to OrchardCore 2.x.

We store our media library files in Azure blob storage, and index the contents of PDFs stored in the media library using the Elasticsearch integration. This worked perfectly fine in OrchardCore 1.x, but after upgrading to 2.x we now get this error:

2024-12-20 16:03:34.3240|||0HN91A5D6FVBI:000000BB|OrchardCore.Contents.Indexing.ContentItemIndexCoordinator|ERR|IContentFieldIndexHandler thrown from OrchardCore.Media.Indexing.MediaFieldIndexHandler by ArgumentException
System.ArgumentException: The provided stream did not support reading.
   at UglyToad.PdfPig.Core.StreamInputBytes..ctor(Stream stream, Boolean shouldDispose)
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(Stream stream, ParsingOptions options)
   at UglyToad.PdfPig.PdfDocument.Open(Stream stream, ParsingOptions options)
   at OrchardCore.Media.Indexing.PdfMediaFileTextProvider.GetTextAsync(String path, Stream fileStream)
   at OrchardCore.Media.Indexing.PdfMediaFileTextProvider.GetTextAsync(String path, Stream fileStream)
   at OrchardCore.Media.Indexing.MediaFieldIndexHandler.BuildIndexAsync(MediaField field, BuildFieldIndexContext context)
   at OrchardCore.Modules.InvokeExtensions.InvokeAsync[TEvents,T1,T2,T3,T4,T5](IEnumerable`1 events, Func`7 dispatch, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, ILogger logger)

This issue seems to be related to a change in PdfMediaFileTextProvider.cs, which now uses a FileStream instead of a MemoryStream to hand off the file data to UglyToad.PdfPig for processing. If I modify the OrchardCore source code to revert back to using a MemoryStream, everything works fine again.

Orchard Core version

2.1.3 (using Nuget packages)

To Reproduce

Enable the ElasticSearch and Azure Media Storage features, and configure appropriately.
Create a new content item from a content type with a Media field.
Use the media field to pick a PDF from the media library.
Publish the content item, which should trigger an indexing of the PDF content.

Expected behavior

Indexing should work fine, and text from the PDF should show up in the search index.

The text was updated successfully, but these errors were encountered:

hishamco · 2024-12-31T06:23:38Z

If this is related to PR #16958 could you please debug and check why the stream does not support reading

cbadger-montecitobank · 2024-12-31T17:23:43Z

After further debugging, this seems to be the offending line:

seekableStream = new FileStream(Path.GetTempFileName(), FileMode.OpenOrCreate, FileAccess.Write, FileShare.None, 4096, FileOptions.DeleteOnClose);

If I change the parameter FileAccess.Write to FileAccess.ReadWrite then the stream becomes readable and everything works fine again.

hishamco · 2025-01-02T06:33:19Z

Do you plan to submit a PR for this?

Fixes OrchardCMS#17291 by specifying that both Read and Write access are needed on the FileStream used for reading PDF files. This bug was observed when indexing PDFs stored in Azure Blog Storage.

cbadger-montecitobank · 2025-01-02T17:09:32Z

@hishamco I just opened a PR here: #17294

This is my first PR for OrchardCore, so hopefully I executed the steps properly.

cbadger-montecitobank added the bug 🐛 label Dec 31, 2024

This was referenced Jan 1, 2025

Monthly community metrics report for 2024-12-01..2024-12-31 #17293

Open

Monthly community metrics report for 2024-12-01..2024-12-31 iaspnetcore/OrchardCore#4

Open

cbadger-montecitobank mentioned this issue Jan 2, 2025

Fix PDF FileStream reading bug #17294

Merged

sebastienros closed this as completed in c0d252c Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing PDFs broken when using Elasticsearch and Azure Media Storage features together #17291

Indexing PDFs broken when using Elasticsearch and Azure Media Storage features together #17291

cbadger-montecitobank commented Dec 31, 2024

hishamco commented Dec 31, 2024

cbadger-montecitobank commented Dec 31, 2024

hishamco commented Jan 2, 2025

cbadger-montecitobank commented Jan 2, 2025

Indexing PDFs broken when using Elasticsearch and Azure Media Storage features together #17291

Indexing PDFs broken when using Elasticsearch and Azure Media Storage features together #17291

Comments

cbadger-montecitobank commented Dec 31, 2024

Describe the bug

Orchard Core version

To Reproduce

Expected behavior

hishamco commented Dec 31, 2024

cbadger-montecitobank commented Dec 31, 2024

hishamco commented Jan 2, 2025

cbadger-montecitobank commented Jan 2, 2025