Searching individual pages in solr

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Searching individual pages in solr

Dustin Lebsock
Hi!

I'm looking for some guidance on engineering a solution for searching individual pages of PDF documents. I currently have a SolrCloud setup that uses an external tika server to extract text data from PDFs. I'd like to be able to search individual pages for search results and for the overall documents themselves (such as titles that link to external repo). I'm having trouble coming up with a clean solution.

I ran across a discussion on stackoverflow about this found here:
https://stackoverflow.com/a/50160163

I can't really see the pros and cons verse indexing a single document with multiple fields for each page vs indexing each page separately and using group queries. What does the solr community recommend?

Thank you for all the help!

Dustin Lebsock
Reply | Threaded
Open this post in threaded view
|

Re: Searching individual pages in solr

Erick Erickson
Well, given the structure of an inverted index, how would you have a clue what page the hit was on? You could conceivably index enough data with payloads and the like, but that’d cause a lot more bloat than just indexing each page.

Using grouping would allow you to show, say, the top three pages from the books with the highest score on an individual page basis.

But there are complications (aren’t there always?). Consider a page with one sentence. Indexed as an individual document, it might score quite high even if not the best choice. Or any embedded illustrations, what do you do with those? Index the caption os apart of the text? Ignore the caption? Etc.

I’d certainly start with a doc-per-page. Not quite sure what I’d do with the title and such, but that depends on your use-case.

Best,
Erick

> On Mar 24, 2020, at 12:22 PM, Dustin Lebsock <[hidden email]> wrote:
>
> Hi!
>
> I'm looking for some guidance on engineering a solution for searching individual pages of PDF documents. I currently have a SolrCloud setup that uses an external tika server to extract text data from PDFs. I'd like to be able to search individual pages for search results and for the overall documents themselves (such as titles that link to external repo). I'm having trouble coming up with a clean solution.
>
> I ran across a discussion on stackoverflow about this found here:
> https://stackoverflow.com/a/50160163
>
> I can't really see the pros and cons verse indexing a single document with multiple fields for each page vs indexing each page separately and using group queries. What does the solr community recommend?
>
> Thank you for all the help!
>
> Dustin Lebsock