Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing PDFs broken when using Elasticsearch and Azure Media Storage features together #17291

Closed
cbadger-montecitobank opened this issue Dec 31, 2024 · 4 comments
Labels

Comments

@cbadger-montecitobank
Copy link
Contributor

Describe the bug

Indexing PDFs using the Elasticsearch feature with the Azure Media Storage feature appears broken after upgrading to OrchardCore 2.x.

We store our media library files in Azure blob storage, and index the contents of PDFs stored in the media library using the Elasticsearch integration. This worked perfectly fine in OrchardCore 1.x, but after upgrading to 2.x we now get this error:

2024-12-20 16:03:34.3240|||0HN91A5D6FVBI:000000BB|OrchardCore.Contents.Indexing.ContentItemIndexCoordinator|ERR|IContentFieldIndexHandler thrown from OrchardCore.Media.Indexing.MediaFieldIndexHandler by ArgumentException
System.ArgumentException: The provided stream did not support reading.
   at UglyToad.PdfPig.Core.StreamInputBytes..ctor(Stream stream, Boolean shouldDispose)
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(Stream stream, ParsingOptions options)
   at UglyToad.PdfPig.PdfDocument.Open(Stream stream, ParsingOptions options)
   at OrchardCore.Media.Indexing.PdfMediaFileTextProvider.GetTextAsync(String path, Stream fileStream)
   at OrchardCore.Media.Indexing.PdfMediaFileTextProvider.GetTextAsync(String path, Stream fileStream)
   at OrchardCore.Media.Indexing.MediaFieldIndexHandler.BuildIndexAsync(MediaField field, BuildFieldIndexContext context)
   at OrchardCore.Modules.InvokeExtensions.InvokeAsync[TEvents,T1,T2,T3,T4,T5](IEnumerable`1 events, Func`7 dispatch, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, ILogger logger)

This issue seems to be related to a change in PdfMediaFileTextProvider.cs, which now uses a FileStream instead of a MemoryStream to hand off the file data to UglyToad.PdfPig for processing. If I modify the OrchardCore source code to revert back to using a MemoryStream, everything works fine again.

Orchard Core version

2.1.3 (using Nuget packages)

To Reproduce

  1. Enable the ElasticSearch and Azure Media Storage features, and configure appropriately.
  2. Create a new content item from a content type with a Media field.
  3. Use the media field to pick a PDF from the media library.
  4. Publish the content item, which should trigger an indexing of the PDF content.

Expected behavior

Indexing should work fine, and text from the PDF should show up in the search index.

@hishamco
Copy link
Member

If this is related to PR #16958 could you please debug and check why the stream does not support reading

@cbadger-montecitobank
Copy link
Contributor Author

After further debugging, this seems to be the offending line:

seekableStream = new FileStream(Path.GetTempFileName(), FileMode.OpenOrCreate, FileAccess.Write, FileShare.None, 4096, FileOptions.DeleteOnClose);

If I change the parameter FileAccess.Write to FileAccess.ReadWrite then the stream becomes readable and everything works fine again.

@hishamco
Copy link
Member

hishamco commented Jan 2, 2025

Do you plan to submit a PR for this?

cbadger-montecitobank added a commit to cbadger-montecitobank/OrchardCore that referenced this issue Jan 2, 2025
Fixes OrchardCMS#17291 by specifying that both Read and Write access are needed on the FileStream used for reading PDF files.  This bug was observed when indexing PDFs stored in Azure Blog Storage.
@cbadger-montecitobank
Copy link
Contributor Author

@hishamco I just opened a PR here: #17294

This is my first PR for OrchardCore, so hopefully I executed the steps properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants