Re: [EXTERNAL] Extracting font information from xml

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Extracting font information from xml

Chris Mattmann
Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika Server and it
provides this functionality. CC’ing dev@tika

 

 

 

From: Jay Chuk <[hidden email]>
Date: Tuesday, October 15, 2019 at 3:47 PM
To: "Mattmann, Chris A (US 1761)" <[hidden email]>
Subject: [EXTERNAL] Extracting font information from xml

 

Hi Chris,

 

Thanks for provide the python package -Tika, to use for extracting text from pdf's.

 

I'll like to confirm it is possible when converting pdf to xml  to get the font style for the text e.g the font type, if the text is bold/solid .

I need such information in identifying section headers and titles from the documents.

 

Please let me know if it is possible or if there is another way tp gp about this.

 

Thank you

Jay

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Extracting font information from xml

Jay Chuk
Thanks for the quick reply Chris.
Please is there a possible code snippet in python for it.

Reagrds,
Jay

On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann <[hidden email]> wrote:

> Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika
> Server and it
> provides this functionality. CC’ing dev@tika
>
>
>
>
>
>
>
> *From: *Jay Chuk <[hidden email]>
> *Date: *Tuesday, October 15, 2019 at 3:47 PM
> *To: *"Mattmann, Chris A (US 1761)" <[hidden email]>
> *Subject: *[EXTERNAL] Extracting font information from xml
>
>
>
> Hi Chris,
>
>
>
> Thanks for provide the python package -Tika, to use for extracting text
> from pdf's.
>
>
>
> I'll like to confirm it is possible when converting pdf to xml  to get the
> font style for the text e.g the font type, if the text is bold/solid .
>
> I need such information in identifying section headers and titles from the
> documents.
>
>
>
> Please let me know if it is possible or if there is another way tp gp
> about this.
>
>
>
> Thank you
>
> Jay
>
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Extracting font information from xml

Chris Mattmann
When you do a parse, do this:

 

from tika import parser

parsed = parser.from_file(‘/path/to/file’, xmlContent=True)

xmlContent = parsed[“content”]

print(xmlContent)

 

G’luck!

 

Cheers
Chris

 

 

 

 

From: Jay Chuk <[hidden email]>
Date: Tuesday, October 15, 2019 at 3:54 PM
To: Chris Mattmann <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: [EXTERNAL] Extracting font information from xml

 

Thanks for the quick reply Chris.

Please is there a possible code snippet in python for it.

 

Reagrds,

Jay

 

On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann <[hidden email]> wrote:

Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika Server and it
provides this functionality. CC’ing dev@tika

 

 

 

From: Jay Chuk <[hidden email]>
Date: Tuesday, October 15, 2019 at 3:47 PM
To: "Mattmann, Chris A (US 1761)" <[hidden email]>
Subject: [EXTERNAL] Extracting font information from xml

 

Hi Chris,

 

Thanks for provide the python package -Tika, to use for extracting text from pdf's.

 

I'll like to confirm it is possible when converting pdf to xml  to get the font style for the text e.g the font type, if the text is bold/solid .

I need such information in identifying section headers and titles from the documents.

 

Please let me know if it is possible or if there is another way tp gp about this.

 

Thank you

Jay

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Extracting font information from xml

Jay Chuk
Thanks Chris
I did that already but within the tag like the paragraph tags there is no
information on the font size or the type of font used.

It only prints out the text

Regards,
Jay

On Tue, Oct 15, 2019, 6:56 PM Chris Mattmann <[hidden email]> wrote:

> When you do a parse, do this:
>
>
>
> from tika import parser
>
> parsed = parser.from_file(‘/path/to/file’, xmlContent=True)
>
> xmlContent = parsed[“content”]
>
> print(xmlContent)
>
>
>
> G’luck!
>
>
>
> Cheers
> Chris
>
>
>
>
>
>
>
>
>
> *From: *Jay Chuk <[hidden email]>
> *Date: *Tuesday, October 15, 2019 at 3:54 PM
> *To: *Chris Mattmann <[hidden email]>
> *Cc: *"[hidden email]" <[hidden email]>
> *Subject: *Re: [EXTERNAL] Extracting font information from xml
>
>
>
> Thanks for the quick reply Chris.
>
> Please is there a possible code snippet in python for it.
>
>
>
> Reagrds,
>
> Jay
>
>
>
> On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann <[hidden email]>
> wrote:
>
> Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika
> Server and it
> provides this functionality. CC’ing dev@tika
>
>
>
>
>
>
>
> *From: *Jay Chuk <[hidden email]>
> *Date: *Tuesday, October 15, 2019 at 3:47 PM
> *To: *"Mattmann, Chris A (US 1761)" <[hidden email]>
> *Subject: *[EXTERNAL] Extracting font information from xml
>
>
>
> Hi Chris,
>
>
>
> Thanks for provide the python package -Tika, to use for extracting text
> from pdf's.
>
>
>
> I'll like to confirm it is possible when converting pdf to xml  to get the
> font style for the text e.g the font type, if the text is bold/solid .
>
> I need such information in identifying section headers and titles from the
> documents.
>
>
>
> Please let me know if it is possible or if there is another way tp gp
> about this.
>
>
>
> Thank you
>
> Jay
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Extracting font information from xml

Tim Allison
We aren’t currently including font information in PDFs. I _think_ it
wouldn’t be too hard to add as <span.../> elements.

On Wed, Oct 16, 2019 at 5:37 AM Jay Chuk <[hidden email]> wrote:

> Thanks Chris
> I did that already but within the tag like the paragraph tags there is no
> information on the font size or the type of font used.
>
> It only prints out the text
>
> Regards,
> Jay
>
> On Tue, Oct 15, 2019, 6:56 PM Chris Mattmann <[hidden email]> wrote:
>
> > When you do a parse, do this:
> >
> >
> >
> > from tika import parser
> >
> > parsed = parser.from_file(‘/path/to/file’, xmlContent=True)
> >
> > xmlContent = parsed[“content”]
> >
> > print(xmlContent)
> >
> >
> >
> > G’luck!
> >
> >
> >
> > Cheers
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *From: *Jay Chuk <[hidden email]>
> > *Date: *Tuesday, October 15, 2019 at 3:54 PM
> > *To: *Chris Mattmann <[hidden email]>
> > *Cc: *"[hidden email]" <[hidden email]>
> > *Subject: *Re: [EXTERNAL] Extracting font information from xml
> >
> >
> >
> > Thanks for the quick reply Chris.
> >
> > Please is there a possible code snippet in python for it.
> >
> >
> >
> > Reagrds,
> >
> > Jay
> >
> >
> >
> > On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann <[hidden email]>
> > wrote:
> >
> > Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika
> > Server and it
> > provides this functionality. CC’ing dev@tika
> >
> >
> >
> >
> >
> >
> >
> > *From: *Jay Chuk <[hidden email]>
> > *Date: *Tuesday, October 15, 2019 at 3:47 PM
> > *To: *"Mattmann, Chris A (US 1761)" <[hidden email]>
> > *Subject: *[EXTERNAL] Extracting font information from xml
> >
> >
> >
> > Hi Chris,
> >
> >
> >
> > Thanks for provide the python package -Tika, to use for extracting text
> > from pdf's.
> >
> >
> >
> > I'll like to confirm it is possible when converting pdf to xml  to get
> the
> > font style for the text e.g the font type, if the text is bold/solid .
> >
> > I need such information in identifying section headers and titles from
> the
> > documents.
> >
> >
> >
> > Please let me know if it is possible or if there is another way tp gp
> > about this.
> >
> >
> >
> > Thank you
> >
> > Jay
> >
> >
>