Regarding pdf indexing issue

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Regarding pdf indexing issue

Rahul Prasad Dwivedi
Hello Team,

I am using the Solr for indexing and searching for pdf document

I have go through with your website document and installed solr but unable
to index and search the document.

For example: Suppose we have a PDF file which have no of paragraph with
separate heading.

So If I search for the title on indexed pdf the result should be contain
the paragraph from where the title belongs.

I am unable to perform this task.

I have run the below command for upload the pdf

*bin/post -c gettingstarted pdf-sample.pdf*

and for searching I am running the command

*curl http://localhost:8983/solr/gettingstarted/select?q='*
<http://localhost:8983/solr/gettingstarted/select?q='*>'*

Please suggest me anything and let me know if I am missing anything

Thanks,

Rahul
Reply | Threaded
Open this post in threaded view
|

Re: Regarding pdf indexing issue

Erick Erickson
Solr will not do this automatically, the Extracting Request Handler
simply indexes the entire contents of the doc without regard to things
like paragraphs etc. Ditto with HTML. This is actually a task that
requires getting into Tika and using all the bells and whistles there.

I'd recommend two things:

1> Take the PDF parsing offline, i.e. in a separate client. There are
many reasons for this, in particular you can attempt to do what you're
asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> Talk to the Tika folks about the best ways to make Tika return the
information such that you can index them and get what you'd like.

Best,
Erick

On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
<[hidden email]> wrote:

> Hello Team,
>
> I am using the Solr for indexing and searching for pdf document
>
> I have go through with your website document and installed solr but unable
> to index and search the document.
>
> For example: Suppose we have a PDF file which have no of paragraph with
> separate heading.
>
> So If I search for the title on indexed pdf the result should be contain
> the paragraph from where the title belongs.
>
> I am unable to perform this task.
>
> I have run the below command for upload the pdf
>
> *bin/post -c gettingstarted pdf-sample.pdf*
>
> and for searching I am running the command
>
> *curl http://localhost:8983/solr/gettingstarted/select?q='*
> <http://localhost:8983/solr/gettingstarted/select?q='*>'*
>
> Please suggest me anything and let me know if I am missing anything
>
> Thanks,
>
> Rahul
Reply | Threaded
Open this post in threaded view
|

Re: Regarding pdf indexing issue

Walter Underwood
PDF is not a structured document format. It is a printer control format.

PDF does not have a paragraph marker. Instead, it says to move
to this spot on the page, choose this font, and print this letter. For a
paragraph, it moves farther. For the next letter in a word, it moves a
little bit. Extracting paragraphs from that is a difficult pattern recognition
problem.

I worked with a PDF of a two-column magazine article that printed
the first line of column 1, then the first line of column 2, then the
second line of column 1, and so on. If a line ended with a hyphenated
word, too bad.

Extracting structure from a PDF document is somewhere between
very hard and impossible. Someone I worked with said that getting
structured text from PDF was like turning hamburger back into a cow.

Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
is used. It appears to be an accessibility feature, so it still might not
be useful for search.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Jul 11, 2018, at 8:07 AM, Erick Erickson <[hidden email]> wrote:
>
> Solr will not do this automatically, the Extracting Request Handler
> simply indexes the entire contents of the doc without regard to things
> like paragraphs etc. Ditto with HTML. This is actually a task that
> requires getting into Tika and using all the bells and whistles there.
>
> I'd recommend two things:
>
> 1> Take the PDF parsing offline, i.e. in a separate client. There are
> many reasons for this, in particular you can attempt to do what you're
> asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> 2> Talk to the Tika folks about the best ways to make Tika return the
> information such that you can index them and get what you'd like.
>
> Best,
> Erick
>
> On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
> <[hidden email]> wrote:
>> Hello Team,
>>
>> I am using the Solr for indexing and searching for pdf document
>>
>> I have go through with your website document and installed solr but unable
>> to index and search the document.
>>
>> For example: Suppose we have a PDF file which have no of paragraph with
>> separate heading.
>>
>> So If I search for the title on indexed pdf the result should be contain
>> the paragraph from where the title belongs.
>>
>> I am unable to perform this task.
>>
>> I have run the below command for upload the pdf
>>
>> *bin/post -c gettingstarted pdf-sample.pdf*
>>
>> and for searching I am running the command
>>
>> *curl http://localhost:8983/solr/gettingstarted/select?q='*
>> <http://localhost:8983/solr/gettingstarted/select?q='*>'*
>>
>> Please suggest me anything and let me know if I am missing anything
>>
>> Thanks,
>>
>> Rahul

Reply | Threaded
Open this post in threaded view
|

Re: Regarding pdf indexing issue

Shamik Sinha
You may try to use tesseract tool to check data extraction from pdf or
images and then go forward accordingly. As far as I understand the PDF is
an image and not data. The searchable PDF actually overlays the selectable
text as hidden text over the PDF image. These PDFs can be indexed and
extracted. These are mostly supported in english and other latin
derivatives. You may face problems to extract/index text based on any other
language. Handwritten text converted to PDFs are next to impossible to
index/extract. Apache Tika may be the solution you are looking for
On Wed 11 Jul, 2018, 9:12 PM Walter Underwood, <[hidden email]>
wrote:

> PDF is not a structured document format. It is a printer control format.
>
> PDF does not have a paragraph marker. Instead, it says to move
> to this spot on the page, choose this font, and print this letter. For a
> paragraph, it moves farther. For the next letter in a word, it moves a
> little bit. Extracting paragraphs from that is a difficult pattern
> recognition
> problem.
>
> I worked with a PDF of a two-column magazine article that printed
> the first line of column 1, then the first line of column 2, then the
> second line of column 1, and so on. If a line ended with a hyphenated
> word, too bad.
>
> Extracting structure from a PDF document is somewhere between
> very hard and impossible. Someone I worked with said that getting
> structured text from PDF was like turning hamburger back into a cow.
>
> Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
> is used. It appears to be an accessibility feature, so it still might not
> be useful for search.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 11, 2018, at 8:07 AM, Erick Erickson <[hidden email]>
> wrote:
> >
> > Solr will not do this automatically, the Extracting Request Handler
> > simply indexes the entire contents of the doc without regard to things
> > like paragraphs etc. Ditto with HTML. This is actually a task that
> > requires getting into Tika and using all the bells and whistles there.
> >
> > I'd recommend two things:
> >
> > 1> Take the PDF parsing offline, i.e. in a separate client. There are
> > many reasons for this, in particular you can attempt to do what you're
> > asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
> >
> > 2> Talk to the Tika folks about the best ways to make Tika return the
> > information such that you can index them and get what you'd like.
> >
> > Best,
> > Erick
> >
> > On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
> > <[hidden email]> wrote:
> >> Hello Team,
> >>
> >> I am using the Solr for indexing and searching for pdf document
> >>
> >> I have go through with your website document and installed solr but
> unable
> >> to index and search the document.
> >>
> >> For example: Suppose we have a PDF file which have no of paragraph with
> >> separate heading.
> >>
> >> So If I search for the title on indexed pdf the result should be contain
> >> the paragraph from where the title belongs.
> >>
> >> I am unable to perform this task.
> >>
> >> I have run the below command for upload the pdf
> >>
> >> *bin/post -c gettingstarted pdf-sample.pdf*
> >>
> >> and for searching I am running the command
> >>
> >> *curl http://localhost:8983/solr/gettingstarted/select?q='*
> >> <http://localhost:8983/solr/gettingstarted/select?q='*>'*
> >>
> >> Please suggest me anything and let me know if I am missing anything
> >>
> >> Thanks,
> >>
> >> Rahul
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Regarding pdf indexing issue

Terry Steichen
In reply to this post by Walter Underwood
Walter,

Well said.  (And I love the hamburger conversion analogy - very apt.)

The only thing I will add is that when you have a collection of similar
rich text documents, you might be able to construct queries to respect
internal structures within the documents.  If all/most of your documents
have a unique line like "subject:", you might be able to be selective.

Also, if your documents are organized on disk in some categorical way,
you can include in your query, a reference to that categorical
information (via the id:*pattern* field).

Finally, there *might* be useful information in the metadata that you
can use in refining your searches.

Terry


On 07/11/2018 11:42 AM, Walter Underwood wrote:

> PDF is not a structured document format. It is a printer control format.
>
> PDF does not have a paragraph marker. Instead, it says to move
> to this spot on the page, choose this font, and print this letter. For a
> paragraph, it moves farther. For the next letter in a word, it moves a
> little bit. Extracting paragraphs from that is a difficult pattern recognition
> problem.
>
> I worked with a PDF of a two-column magazine article that printed
> the first line of column 1, then the first line of column 2, then the
> second line of column 1, and so on. If a line ended with a hyphenated
> word, too bad.
>
> Extracting structure from a PDF document is somewhere between
> very hard and impossible. Someone I worked with said that getting
> structured text from PDF was like turning hamburger back into a cow.
>
> Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
> is used. It appears to be an accessibility feature, so it still might not
> be useful for search.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 11, 2018, at 8:07 AM, Erick Erickson <[hidden email]> wrote:
>>
>> Solr will not do this automatically, the Extracting Request Handler
>> simply indexes the entire contents of the doc without regard to things
>> like paragraphs etc. Ditto with HTML. This is actually a task that
>> requires getting into Tika and using all the bells and whistles there.
>>
>> I'd recommend two things:
>>
>> 1> Take the PDF parsing offline, i.e. in a separate client. There are
>> many reasons for this, in particular you can attempt to do what you're
>> asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>
>> 2> Talk to the Tika folks about the best ways to make Tika return the
>> information such that you can index them and get what you'd like.
>>
>> Best,
>> Erick
>>
>> On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
>> <[hidden email]> wrote:
>>> Hello Team,
>>>
>>> I am using the Solr for indexing and searching for pdf document
>>>
>>> I have go through with your website document and installed solr but unable
>>> to index and search the document.
>>>
>>> For example: Suppose we have a PDF file which have no of paragraph with
>>> separate heading.
>>>
>>> So If I search for the title on indexed pdf the result should be contain
>>> the paragraph from where the title belongs.
>>>
>>> I am unable to perform this task.
>>>
>>> I have run the below command for upload the pdf
>>>
>>> *bin/post -c gettingstarted pdf-sample.pdf*
>>>
>>> and for searching I am running the command
>>>
>>> *curl http://localhost:8983/solr/gettingstarted/select?q='*
>>> <http://localhost:8983/solr/gettingstarted/select?q='*>'*
>>>
>>> Please suggest me anything and let me know if I am missing anything
>>>
>>> Thanks,
>>>
>>> Rahul
>