Parsing order issue

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing order issue

Lu Sun

Dear Tika Dev Team,

 

Hope this email finds you well.

 

I have been actively using Tika for pdf file reading. One issue I found is the parsing order. As shown in attached image, the parsing order of pdf file is not  based on position of texts.

 

As suggested in this github link, I used a customized config file (see attached), hoping to solve the issue. But this has not worked out. If any chance, can you please review this issue, and provide any insights or solutions?

 

Thanks so much in advance.

 

Regards,

Luke


tika.config (918 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Tilman Hausherr
Please upload the PDF to a sharehoster.

Tilman

Am 15.12.2019 um 23:21 schrieb Lu Sun:

>
> Dear Tika Dev Team,
>
> Hope this email finds you well.
>
> I have been actively using Tika for pdf file reading. One issue I
> found is the parsing order. As shown in attached image, the parsing
> order of pdf file is not  based on position of texts.
>
> As suggested in this github link
> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
> customized config file (see attached), hoping to solve the issue. But
> this has not worked out. If any chance, can you please review this
> issue, and provide any insights or solutions?
>
> Thanks so much in advance.
>
> Regards,
>
> Luke
>

Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Tim Allison
In reply to this post by Lu Sun
PDFBox Colleagues,
  Any recommendations?

On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]> wrote:

> Dear Tika Dev Team,
>
>
>
> Hope this email finds you well.
>
>
>
> I have been actively using Tika for pdf file reading. One issue I found is
> the parsing order. As shown in attached image, the parsing order of pdf
> file is not  based on position of texts.
>
>
>
> As suggested in this github link
> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
> customized config file (see attached), hoping to solve the issue. But this
> has not worked out. If any chance, can you please review this issue, and
> provide any insights or solutions?
>
>
>
> Thanks so much in advance.
>
>
>
> Regards,
>
> Luke
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Maruan Sahyoun
Hi Tim,

unfortunately the image didn't make it to the mailing list. What is the issue here? Is the extracted text not in the right
order?

Order of PDF parsing and visual order of text are not related.

BR
Maruan

 

> PDFBox Colleagues,
>   Any recommendations?
>
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]> wrote:
>
> > Dear Tika Dev Team,
> >
> >
> >
> > Hope this email finds you well.
> >
> >
> >
> > I have been actively using Tika for pdf file reading. One issue I found is
> > the parsing order. As shown in attached image, the parsing order of pdf
> > file is not  based on position of texts.
> >
> >
> >
> > As suggested in this github link
> > <https://github.com/chrismattmann/tika-python/issues/266>;, I used a
> > customized config file (see attached), hoping to solve the issue. But this
> > has not worked out. If any chance, can you please review this issue, and
> > provide any insights or solutions?
> >
> >
> >
> > Thanks so much in advance.
> >
> >
> >
> > Regards,
> >
> > Luke
> >
--



Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Tilman Hausherr
In reply to this post by Tim Allison
I already answered... we need the PDF.

But... about the config:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser for most things, except for 2 mime types, and never
          use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>image/jpeg</mime-exclude>
       <mime-exclude>application/pdf</mime-exclude>
       <parser-exclude
class="org.apache.tika.parser.executable.ExecutableParser"/>
     </parser>

     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.DefaultParser">
     <property name="sortByPosition" value="true"/>
       <mime>application/pdf</mime>
     </parser>
   </parsers>
</properties>

Is this a correct setting for PDFs in tika? I notice that the same
parser class is used twice.

And the file was named "tika.config", shouldn't it be named
"tika-config.xml"?

Tilman

Am 17.12.2019 um 13:33 schrieb Tim Allison:

> PDFBox Colleagues,
>    Any recommendations?
>
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]> wrote:
>
>> Dear Tika Dev Team,
>>
>>
>>
>> Hope this email finds you well.
>>
>>
>>
>> I have been actively using Tika for pdf file reading. One issue I found is
>> the parsing order. As shown in attached image, the parsing order of pdf
>> file is not  based on position of texts.
>>
>>
>>
>> As suggested in this github link
>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>> customized config file (see attached), hoping to solve the issue. But this
>> has not worked out. If any chance, can you please review this issue, and
>> provide any insights or solutions?
>>
>>
>>
>> Thanks so much in advance.
>>
>>
>>
>> Regards,
>>
>> Luke
>>

Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Tim Allison
Tilman,
   That isn’t correct. I’ll find the link that might help...

On Tue, Dec 17, 2019 at 1:02 PM Tilman Hausherr <[hidden email]>
wrote:

> I already answered... we need the PDF.
>
> But... about the config:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>    <parsers>
>      <!-- Default Parser for most things, except for 2 mime types, and
> never
>           use the Executable Parser -->
>      <parser class="org.apache.tika.parser.DefaultParser">
>        <mime-exclude>image/jpeg</mime-exclude>
>        <mime-exclude>application/pdf</mime-exclude>
>        <parser-exclude
> class="org.apache.tika.parser.executable.ExecutableParser"/>
>      </parser>
>
>      <!-- Use a different parser for PDF -->
>      <parser class="org.apache.tika.parser.DefaultParser">
>      <property name="sortByPosition" value="true"/>
>        <mime>application/pdf</mime>
>      </parser>
>    </parsers>
> </properties>
>
> Is this a correct setting for PDFs in tika? I notice that the same
> parser class is used twice.
>
> And the file was named "tika.config", shouldn't it be named
> "tika-config.xml"?
>
> Tilman
>
> Am 17.12.2019 um 13:33 schrieb Tim Allison:
> > PDFBox Colleagues,
> >    Any recommendations?
> >
> > On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]> wrote:
> >
> >> Dear Tika Dev Team,
> >>
> >>
> >>
> >> Hope this email finds you well.
> >>
> >>
> >>
> >> I have been actively using Tika for pdf file reading. One issue I found
> is
> >> the parsing order. As shown in attached image, the parsing order of pdf
> >> file is not  based on position of texts.
> >>
> >>
> >>
> >> As suggested in this github link
> >> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
> >> customized config file (see attached), hoping to solve the issue. But
> this
> >> has not worked out. If any chance, can you please review this issue, and
> >> provide any insights or solutions?
> >>
> >>
> >>
> >> Thanks so much in advance.
> >>
> >>
> >>
> >> Regards,
> >>
> >> Luke
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Lu Sun
In reply to this post by Tim Allison
Dear PDFBox Dev Team, 

Hope this message finds you well. 

Just wanted to raise this for your attention. Please can you provide any solutions on the parsing order issue? Attached is my config file, an example of pdf file and my parsing results. 

Thanks so much in advance. Wish you and your team a Merry Christmas and Happy New Year.

Regards, 
Luke

On Tue, 17 Dec 2019 at 12:34, Tim Allison <[hidden email]> wrote:
PDFBox Colleagues,
  Any recommendations?

On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]> wrote:

Dear Tika Dev Team,

 

Hope this email finds you well.

 

I have been actively using Tika for pdf file reading. One issue I found is the parsing order. As shown in attached image, the parsing order of pdf file is not  based on position of texts.

 

As suggested in this github link, I used a customized config file (see attached), hoping to solve the issue. But this has not worked out. If any chance, can you please review this issue, and provide any insights or solutions?

 

Thanks so much in advance.

 

Regards,

Luke


tika.config (918 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Tilman Hausherr
I answered, asked to have a look at your file (upload to a sharehoster),
and mentioned that your config file is suspicious.

Tilman

Am 20.12.2019 um 19:06 schrieb Lu Sun:

> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you provide
> any solutions on the parsing order issue? Attached is my config file,
> an example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry Christmas
> and Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     PDFBox Colleagues,
>       Any recommendations?
>
>     On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         Dear Tika Dev Team,
>
>         Hope this email finds you well.
>
>         I have been actively using Tika for pdf file reading. One
>         issue I found is the parsing order. As shown in attached
>         image, the parsing order of pdf file is not  based on position
>         of texts.
>
>         As suggested in this github link
>         <https://github.com/chrismattmann/tika-python/issues/266>, I
>         used a customized config file (see attached), hoping to solve
>         the issue. But this has not worked out. If any chance, can you
>         please review this issue, and provide any insights or solutions?
>
>         Thanks so much in advance.
>
>         Regards,
>
>         Luke
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Lu Sun
In reply to this post by Lu Sun
Dear PDFBox Dev Team,

After searching through online
<https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
am certain that using setSortByPosition(true) would help. However, I am
struggling to get the config file right. Can you please provide any advice
on it?

Thanks so much in advance. Regards, Luke

On Fri, 20 Dec 2019 at 18:06, Lu Sun <[hidden email]> wrote:

> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you provide any
> solutions on the parsing order issue? Attached is my config file, an
> example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry Christmas and
> Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <[hidden email]> wrote:
>
>> PDFBox Colleagues,
>>   Any recommendations?
>>
>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]> wrote:
>>
>>> Dear Tika Dev Team,
>>>
>>>
>>>
>>> Hope this email finds you well.
>>>
>>>
>>>
>>> I have been actively using Tika for pdf file reading. One issue I found
>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>> file is not  based on position of texts.
>>>
>>>
>>>
>>> As suggested in this github link
>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>> customized config file (see attached), hoping to solve the issue. But this
>>> has not worked out. If any chance, can you please review this issue, and
>>> provide any insights or solutions?
>>>
>>>
>>>
>>> Thanks so much in advance.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Luke
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Parsing order issue

Tilman Hausherr
 From my understanding, when you want to use sortbyposition in tika, you
need to have a segment like this:

...
         <parser class="org.apache.tika.parser.pdf.PDFParser">
             <params>
                 <param name="sortByPosition" type="bool">true</param>
             </params>
         </parser>
...

so your whole file would be like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser for most things, except for 2 mime types, and never
          use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>application/pdf</mime-exclude>
     </parser>
     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.pdf.PDFParser">
        <mime>application/pdf</mime>
        <params>
         <param name="sortByPosition" type="bool">true</param>
       </params>
     </parser>
   </parsers>
</properties>


I just tried this file with tika-app. The default didn't sort, using
this did sort. I added " --config=config.xml" at the command line.

Tilman

Am 07.01.2020 um 00:04 schrieb Lu Sun:

> Dear PDFBox Dev Team,
>
> After searching through online
> <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
> am certain that using setSortByPosition(true) would help. However, I am
> struggling to get the config file right. Can you please provide any advice
> on it?
>
> Thanks so much in advance. Regards, Luke
>
> On Fri, 20 Dec 2019 at 18:06, Lu Sun <[hidden email]> wrote:
>
>> Dear PDFBox Dev Team,
>>
>> Hope this message finds you well.
>>
>> Just wanted to raise this for your attention. Please can you provide any
>> solutions on the parsing order issue? Attached is my config file, an
>> example of pdf file and my parsing results.
>>
>> Thanks so much in advance. Wish you and your team a Merry Christmas and
>> Happy New Year.
>>
>> Regards,
>> Luke
>>
>> On Tue, 17 Dec 2019 at 12:34, Tim Allison <[hidden email]> wrote:
>>
>>> PDFBox Colleagues,
>>>    Any recommendations?
>>>
>>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[hidden email]> wrote:
>>>
>>>> Dear Tika Dev Team,
>>>>
>>>>
>>>>
>>>> Hope this email finds you well.
>>>>
>>>>
>>>>
>>>> I have been actively using Tika for pdf file reading. One issue I found
>>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>>> file is not  based on position of texts.
>>>>
>>>>
>>>>
>>>> As suggested in this github link
>>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>>> customized config file (see attached), hoping to solve the issue. But this
>>>> has not worked out. If any chance, can you please review this issue, and
>>>> provide any insights or solutions?
>>>>
>>>>
>>>>
>>>> Thanks so much in advance.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Luke
>>>>