HTML to PDF conversion

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML to PDF conversion

Sergey Beryozkin
Hi All

I've seen a Quarkus user asking how to convert to PDF, and one of my
colleagues pointed to
http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html

Does it make sense for Tika to offer something related to the text to PDF
(for a start, something on top of that transformer), and then may be even
for other formats ?

Sergey
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Sergey Beryozkin
or on top of PDFBox ?

On Mon, Oct 14, 2019 at 12:38 PM Sergey Beryozkin <[hidden email]>
wrote:

> Hi All
>
> I've seen a Quarkus user asking how to convert to PDF, and one of my
> colleagues pointed to
>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>
> Does it make sense for Tika to offer something related to the text to PDF
> (for a start, something on top of that transformer), and then may be even
> for other formats ?
>
> Sergey
>
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Ken Krugler
In reply to this post by Sergey Beryozkin
If you’re suggesting ways to make it easier to use something like YaHPConverter with Tika, definitely yes.

If you’re talking about integrating this functionality…my personal view is no.

I think Tika should focus on extracting content from documents, versus format transformations.

Tika is an attractive location for functionality like this, since it sits in the middle of a lot of data processing pipelines, but I worry about a bloated code base, with corresponding challenges in maintenance and support.

Regards,

— Ken


> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]> wrote:
>
> Hi All
>
> I've seen a Quarkus user asking how to convert to PDF, and one of my
> colleagues pointed to
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>
> Does it make sense for Tika to offer something related to the text to PDF
> (for a start, something on top of that transformer), and then may be even
> for other formats ?
>
> Sergey

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Simone Tripodi-2
In reply to this post by Sergey Beryozkin
Hi Sergey,

even if a little outdated, I would like to point to an old article I
co-operated with another long time ASF member Christian Grobmeier,
about an efficient pipeline for PDF generation using APache Cocoon3
and Apache FOP.

In your case your pipeline would be HTML -> HTML Tidy -> FOP -> PDF

HTH!
Best,
~Simo

[1] https://grobmeier.solutions/create-pdf-cocoon-3-struts-2-15112011.html

http://people.apache.org/~simonetripodi/
http://www.99soft.org/

On Mon, Oct 14, 2019 at 1:39 PM Sergey Beryozkin <[hidden email]> wrote:

>
> Hi All
>
> I've seen a Quarkus user asking how to convert to PDF, and one of my
> colleagues pointed to
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>
> Does it make sense for Tika to offer something related to the text to PDF
> (for a start, something on top of that transformer), and then may be even
> for other formats ?
>
> Sergey
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Tilman Hausherr
In reply to this post by Sergey Beryozkin
Am 14.10.2019 um 13:39 schrieb Sergey Beryozkin:
> or on top of PDFBox ?


This project on top of PDFBox converts HTML to PDF:

https://github.com/danfickle/openhtmltopdf


Tilman



>
> On Mon, Oct 14, 2019 at 12:38 PM Sergey Beryozkin <[hidden email]>
> wrote:
>
>> Hi All
>>
>> I've seen a Quarkus user asking how to convert to PDF, and one of my
>> colleagues pointed to
>>
>> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>>
>> Does it make sense for Tika to offer something related to the text to PDF
>> (for a start, something on top of that transformer), and then may be even
>> for other formats ?
>>
>> Sergey
>>

Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Sergey Beryozkin
Hi All

Thanks for the comments;
Simone, Tilman, thanks for the links :-), shared them with our users [1]

Cheers, Sergey

[1]
https://quarkusio.zulipchat.com/#narrow/stream/187030-users/topic/Generate.20pdf.20endpoint

On Mon, Oct 14, 2019 at 5:57 PM Tilman Hausherr <[hidden email]>
wrote:

> Am 14.10.2019 um 13:39 schrieb Sergey Beryozkin:
> > or on top of PDFBox ?
>
>
> This project on top of PDFBox converts HTML to PDF:
>
> https://github.com/danfickle/openhtmltopdf
>
>
> Tilman
>
>
>
> >
> > On Mon, Oct 14, 2019 at 12:38 PM Sergey Beryozkin <[hidden email]>
> > wrote:
> >
> >> Hi All
> >>
> >> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >> colleagues pointed to
> >>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>
> >> Does it make sense for Tika to offer something related to the text to
> PDF
> >> (for a start, something on top of that transformer), and then may be
> even
> >> for other formats ?
> >>
> >> Sergey
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Sergey Beryozkin
In reply to this post by Ken Krugler
Ken, thanks for the feedback, I meant to reply to your comments,

I suppose I really meant Tika offering a uniform API to create some simple
structured PDF/etc files.
ContentCreator creator = ContentCreator.get("PDF");
creator.addTitle("Introduction to Tika");
creator.addText("");
creator.addTable("tablename", new LinkedHashMap<String, List<String>>());
creator.addAttachment(someImage);
creator.complete();

It would be consistent with the Tika approach on the read side.

Cheers, Sergey
On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]> wrote:

> If you’re suggesting ways to make it easier to use something like
> YaHPConverter with Tika, definitely yes.
>
> If you’re talking about integrating this functionality…my personal view is
> no.
>
> I think Tika should focus on extracting content from documents, versus
> format transformations.
>
> Tika is an attractive location for functionality like this, since it sits
> in the middle of a lot of data processing pipelines, but I worry about a
> bloated code base, with corresponding challenges in maintenance and support.
>
> Regards,
>
> — Ken
>
>
> > On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
> wrote:
> >
> > Hi All
> >
> > I've seen a Quarkus user asking how to convert to PDF, and one of my
> > colleagues pointed to
> >
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >
> > Does it make sense for Tika to offer something related to the text to PDF
> > (for a start, something on top of that transformer), and then may be even
> > for other formats ?
> >
> > Sergey
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Dave Fisher-2
Hi -

You may want to take a look at Apache FOP which is part of the Apache XML Graphics project. My team had success with that in generating PDF from XML.

Regards,
Dave

> On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <[hidden email]> wrote:
>
> Ken, thanks for the feedback, I meant to reply to your comments,
>
> I suppose I really meant Tika offering a uniform API to create some simple
> structured PDF/etc files.
> ContentCreator creator = ContentCreator.get("PDF");
> creator.addTitle("Introduction to Tika");
> creator.addText("");
> creator.addTable("tablename", new LinkedHashMap<String, List<String>>());
> creator.addAttachment(someImage);
> creator.complete();
>
> It would be consistent with the Tika approach on the read side.
>
> Cheers, Sergey
> On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]> wrote:
>
>> If you’re suggesting ways to make it easier to use something like
>> YaHPConverter with Tika, definitely yes.
>>
>> If you’re talking about integrating this functionality…my personal view is
>> no.
>>
>> I think Tika should focus on extracting content from documents, versus
>> format transformations.
>>
>> Tika is an attractive location for functionality like this, since it sits
>> in the middle of a lot of data processing pipelines, but I worry about a
>> bloated code base, with corresponding challenges in maintenance and support.
>>
>> Regards,
>>
>> — Ken
>>
>>
>>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
>> wrote:
>>>
>>> Hi All
>>>
>>> I've seen a Quarkus user asking how to convert to PDF, and one of my
>>> colleagues pointed to
>>>
>> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>>>
>>> Does it make sense for Tika to offer something related to the text to PDF
>>> (for a start, something on top of that transformer), and then may be even
>>> for other formats ?
>>>
>>> Sergey
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Ken Krugler
In reply to this post by Sergey Beryozkin
I can see the attraction of one API to convert XHTML to various formats.

Though very quickly that simple API would become complex, as each target format has its own conversion options.

And if successful, we’d pull in even more 3rd party jars to handle that conversion.

Wonder if there’s a need for a new project called “Akit”, which focuses on XHTML -> various formats :)

— Ken

> On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <[hidden email]> wrote:
>
> Ken, thanks for the feedback, I meant to reply to your comments,
>
> I suppose I really meant Tika offering a uniform API to create some simple
> structured PDF/etc files.
> ContentCreator creator = ContentCreator.get("PDF");
> creator.addTitle("Introduction to Tika");
> creator.addText("");
> creator.addTable("tablename", new LinkedHashMap<String, List<String>>());
> creator.addAttachment(someImage);
> creator.complete();
>
> It would be consistent with the Tika approach on the read side.
>
> Cheers, Sergey
> On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]> wrote:
>
>> If you’re suggesting ways to make it easier to use something like
>> YaHPConverter with Tika, definitely yes.
>>
>> If you’re talking about integrating this functionality…my personal view is
>> no.
>>
>> I think Tika should focus on extracting content from documents, versus
>> format transformations.
>>
>> Tika is an attractive location for functionality like this, since it sits
>> in the middle of a lot of data processing pipelines, but I worry about a
>> bloated code base, with corresponding challenges in maintenance and support.
>>
>> Regards,
>>
>> — Ken
>>
>>
>>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
>> wrote:
>>>
>>> Hi All
>>>
>>> I've seen a Quarkus user asking how to convert to PDF, and one of my
>>> colleagues pointed to
>>>
>> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>>>
>>> Does it make sense for Tika to offer something related to the text to PDF
>>> (for a start, something on top of that transformer), and then may be even
>>> for other formats ?
>>>
>>> Sergey
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Sergey Beryozkin
In reply to this post by Dave Fisher-2
Hi Dave

Thanks, I was suggesting a more neutral approach

Cheers, Sergey

On Wed, Oct 16, 2019 at 3:50 PM Dave Fisher <[hidden email]> wrote:

> Hi -
>
> You may want to take a look at Apache FOP which is part of the Apache XML
> Graphics project. My team had success with that in generating PDF from XML.
>
> Regards,
> Dave
>
> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <[hidden email]>
> wrote:
> >
> > Ken, thanks for the feedback, I meant to reply to your comments,
> >
> > I suppose I really meant Tika offering a uniform API to create some
> simple
> > structured PDF/etc files.
> > ContentCreator creator = ContentCreator.get("PDF");
> > creator.addTitle("Introduction to Tika");
> > creator.addText("");
> > creator.addTable("tablename", new LinkedHashMap<String, List<String>>());
> > creator.addAttachment(someImage);
> > creator.complete();
> >
> > It would be consistent with the Tika approach on the read side.
> >
> > Cheers, Sergey
> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]> wrote:
> >
> >> If you’re suggesting ways to make it easier to use something like
> >> YaHPConverter with Tika, definitely yes.
> >>
> >> If you’re talking about integrating this functionality…my personal view
> is
> >> no.
> >>
> >> I think Tika should focus on extracting content from documents, versus
> >> format transformations.
> >>
> >> Tika is an attractive location for functionality like this, since it
> sits
> >> in the middle of a lot of data processing pipelines, but I worry about a
> >> bloated code base, with corresponding challenges in maintenance and
> support.
> >>
> >> Regards,
> >>
> >> — Ken
> >>
> >>
> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
> >> wrote:
> >>>
> >>> Hi All
> >>>
> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >>> colleagues pointed to
> >>>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>>
> >>> Does it make sense for Tika to offer something related to the text to
> PDF
> >>> (for a start, something on top of that transformer), and then may be
> even
> >>> for other formats ?
> >>>
> >>> Sergey
> >>
> >> --------------------------
> >> Ken Krugler
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Sergey Beryozkin
In reply to this post by Ken Krugler
It was not what I was suggesting. My only proposal was about having a
simple API (without an attempt to cover all the various format specific
options at the API level) which would let Tika users quickly create format
specific content without having to deal with the format specific libraries,
exactly consistent what it does on the read side.
I appreciate it can require some effort and by no means I'm pushing for it

Sergey

On Wed, Oct 16, 2019 at 4:50 PM Ken Krugler <[hidden email]> wrote:

> I can see the attraction of one API to convert XHTML to various formats.
>
> Though very quickly that simple API would become complex, as each target
> format has its own conversion options.
>
> And if successful, we’d pull in even more 3rd party jars to handle that
> conversion.
>
> Wonder if there’s a need for a new project called “Akit”, which focuses on
> XHTML -> various formats :)
>
> — Ken
>
> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <[hidden email]>
> wrote:
> >
> > Ken, thanks for the feedback, I meant to reply to your comments,
> >
> > I suppose I really meant Tika offering a uniform API to create some
> simple
> > structured PDF/etc files.
> > ContentCreator creator = ContentCreator.get("PDF");
> > creator.addTitle("Introduction to Tika");
> > creator.addText("");
> > creator.addTable("tablename", new LinkedHashMap<String, List<String>>());
> > creator.addAttachment(someImage);
> > creator.complete();
> >
> > It would be consistent with the Tika approach on the read side.
> >
> > Cheers, Sergey
> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]> wrote:
> >
> >> If you’re suggesting ways to make it easier to use something like
> >> YaHPConverter with Tika, definitely yes.
> >>
> >> If you’re talking about integrating this functionality…my personal view
> is
> >> no.
> >>
> >> I think Tika should focus on extracting content from documents, versus
> >> format transformations.
> >>
> >> Tika is an attractive location for functionality like this, since it
> sits
> >> in the middle of a lot of data processing pipelines, but I worry about a
> >> bloated code base, with corresponding challenges in maintenance and
> support.
> >>
> >> Regards,
> >>
> >> — Ken
> >>
> >>
> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
> >> wrote:
> >>>
> >>> Hi All
> >>>
> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >>> colleagues pointed to
> >>>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>>
> >>> Does it make sense for Tika to offer something related to the text to
> PDF
> >>> (for a start, something on top of that transformer), and then may be
> even
> >>> for other formats ?
> >>>
> >>> Sergey
> >>
> >> --------------------------
> >> Ken Krugler
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Sergey Beryozkin
Such an API would of course have the limitations in that a pretty simple
format specific content could be created, but many PDFs I've seen a very
simple, so I can imagine having for ex TikaPDFCreator implementation of the
ContentCreator interface which would just do some simple delegation to
PDFBox

But anyway, plenty of tools exists for it...

Cheers, Sergey

On Wed, Oct 16, 2019 at 4:59 PM Sergey Beryozkin <[hidden email]>
wrote:

> It was not what I was suggesting. My only proposal was about having a
> simple API (without an attempt to cover all the various format specific
> options at the API level) which would let Tika users quickly create format
> specific content without having to deal with the format specific libraries,
> exactly consistent what it does on the read side.
> I appreciate it can require some effort and by no means I'm pushing for it
>
> Sergey
>
> On Wed, Oct 16, 2019 at 4:50 PM Ken Krugler <[hidden email]> wrote:
>
>> I can see the attraction of one API to convert XHTML to various formats.
>>
>> Though very quickly that simple API would become complex, as each target
>> format has its own conversion options.
>>
>> And if successful, we’d pull in even more 3rd party jars to handle that
>> conversion.
>>
>> Wonder if there’s a need for a new project called “Akit”, which focuses
>> on XHTML -> various formats :)
>>
>> — Ken
>>
>> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <[hidden email]>
>> wrote:
>> >
>> > Ken, thanks for the feedback, I meant to reply to your comments,
>> >
>> > I suppose I really meant Tika offering a uniform API to create some
>> simple
>> > structured PDF/etc files.
>> > ContentCreator creator = ContentCreator.get("PDF");
>> > creator.addTitle("Introduction to Tika");
>> > creator.addText("");
>> > creator.addTable("tablename", new LinkedHashMap<String,
>> List<String>>());
>> > creator.addAttachment(someImage);
>> > creator.complete();
>> >
>> > It would be consistent with the Tika approach on the read side.
>> >
>> > Cheers, Sergey
>> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]>
>> wrote:
>> >
>> >> If you’re suggesting ways to make it easier to use something like
>> >> YaHPConverter with Tika, definitely yes.
>> >>
>> >> If you’re talking about integrating this functionality…my personal
>> view is
>> >> no.
>> >>
>> >> I think Tika should focus on extracting content from documents, versus
>> >> format transformations.
>> >>
>> >> Tika is an attractive location for functionality like this, since it
>> sits
>> >> in the middle of a lot of data processing pipelines, but I worry about
>> a
>> >> bloated code base, with corresponding challenges in maintenance and
>> support.
>> >>
>> >> Regards,
>> >>
>> >> — Ken
>> >>
>> >>
>> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
>> >> wrote:
>> >>>
>> >>> Hi All
>> >>>
>> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
>> >>> colleagues pointed to
>> >>>
>> >>
>> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>> >>>
>> >>> Does it make sense for Tika to offer something related to the text to
>> PDF
>> >>> (for a start, something on top of that transformer), and then may be
>> even
>> >>> for other formats ?
>> >>>
>> >>> Sergey
>> >>
>> >> --------------------------
>> >> Ken Krugler
>> >> http://www.scaleunlimited.com
>> >> custom big data solutions & training
>> >> Hadoop, Cascading, Cassandra & Solr
>> >>
>> >>
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Tim Allison
In reply to this post by Ken Krugler
+1 to Ken’s earlier point about maintenance. Note Tika wouldn’t even build
in Germany, and we only discovered that because of inviting Tilman. :D We
have a huge amount of maintenance already...

Checkout the incubating Daffodil project that aims to convert files to xml,
validate them and then serialize back to original format.

I do see a use for transform() and if we could use xhtml as an
intermediary, then...maybe, but My inclination is w Ken.

On Wed, Oct 16, 2019 at 11:50 AM Ken Krugler <[hidden email]> wrote:

> I can see the attraction of one API to convert XHTML to various formats.
>
> Though very quickly that simple API would become complex, as each target
> format has its own conversion options.
>
> And if successful, we’d pull in even more 3rd party jars to handle that
> conversion.
>
> Wonder if there’s a need for a new project called “Akit”, which focuses on
> XHTML -> various formats :)
>
> — Ken
>
> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <[hidden email]>
> wrote:
> >
> > Ken, thanks for the feedback, I meant to reply to your comments,
> >
> > I suppose I really meant Tika offering a uniform API to create some
> simple
> > structured PDF/etc files.
> > ContentCreator creator = ContentCreator.get("PDF");
> > creator.addTitle("Introduction to Tika");
> > creator.addText("");
> > creator.addTable("tablename", new LinkedHashMap<String, List<String>>());
> > creator.addAttachment(someImage);
> > creator.complete();
> >
> > It would be consistent with the Tika approach on the read side.
> >
> > Cheers, Sergey
> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]> wrote:
> >
> >> If you’re suggesting ways to make it easier to use something like
> >> YaHPConverter with Tika, definitely yes.
> >>
> >> If you’re talking about integrating this functionality…my personal view
> is
> >> no.
> >>
> >> I think Tika should focus on extracting content from documents, versus
> >> format transformations.
> >>
> >> Tika is an attractive location for functionality like this, since it
> sits
> >> in the middle of a lot of data processing pipelines, but I worry about a
> >> bloated code base, with corresponding challenges in maintenance and
> support.
> >>
> >> Regards,
> >>
> >> — Ken
> >>
> >>
> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
> >> wrote:
> >>>
> >>> Hi All
> >>>
> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >>> colleagues pointed to
> >>>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>>
> >>> Does it make sense for Tika to offer something related to the text to
> PDF
> >>> (for a start, something on top of that transformer), and then may be
> even
> >>> for other formats ?
> >>>
> >>> Sergey
> >>
> >> --------------------------
> >> Ken Krugler
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
Reply | Threaded
Open this post in threaded view
|

Re: HTML to PDF conversion

Sergey Beryozkin
Hi Tim, All
Sure, agree that Tika is not really about the transformation. etc, it is
just not what I was suggesting, even though I started with a link to IHTML
to PRD transformer. Let me just clarify one more time and I'll be happy to
move on. So, trying to put it into a practical surface:
- create a tika-format-creator (or similarly named) module
- introduce a simple generic API (similarly to the prototype API earlier in
the thread) for creating simple format specific docs and document it is
going to stay experimental for a while
- this API is not about transformation but for Tika users to create the
docs directly
- provide two implementations of this API for a start only, one for PDF,
another one for ODT. In time it may grow a bit to support few more most
used formats, no goal to support hundreds of formats. (This is why I don't
understand the maintenance concern :-) )

In the end the users would be able to use Tika specific API to read and for
some most used formats - create docs.
Tika appeal is about having the uniform API for reading N formats, so the
users don't have to have a code switching between N format specific parser
APIs. But the users working with Tika and having an additional task of
creating some formats still have to go beyond Tika...ending up with a
semi-generic code after all. That was the idea I tried to convey earlier in
the thread...

Thanks all, Sergey


On Wed, Oct 16, 2019 at 5:07 PM Tim Allison <[hidden email]> wrote:

> +1 to Ken’s earlier point about maintenance. Note Tika wouldn’t even build
> in Germany, and we only discovered that because of inviting Tilman. :D We
> have a huge amount of maintenance already...
>
> Checkout the incubating Daffodil project that aims to convert files to xml,
> validate them and then serialize back to original format.
>
> I do see a use for transform() and if we could use xhtml as an
> intermediary, then...maybe, but My inclination is w Ken.
>
> On Wed, Oct 16, 2019 at 11:50 AM Ken Krugler <[hidden email]> wrote:
>
> > I can see the attraction of one API to convert XHTML to various formats.
> >
> > Though very quickly that simple API would become complex, as each target
> > format has its own conversion options.
> >
> > And if successful, we’d pull in even more 3rd party jars to handle that
> > conversion.
> >
> > Wonder if there’s a need for a new project called “Akit”, which focuses
> on
> > XHTML -> various formats :)
> >
> > — Ken
> >
> > > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <[hidden email]>
> > wrote:
> > >
> > > Ken, thanks for the feedback, I meant to reply to your comments,
> > >
> > > I suppose I really meant Tika offering a uniform API to create some
> > simple
> > > structured PDF/etc files.
> > > ContentCreator creator = ContentCreator.get("PDF");
> > > creator.addTitle("Introduction to Tika");
> > > creator.addText("");
> > > creator.addTable("tablename", new LinkedHashMap<String,
> List<String>>());
> > > creator.addAttachment(someImage);
> > > creator.complete();
> > >
> > > It would be consistent with the Tika approach on the read side.
> > >
> > > Cheers, Sergey
> > > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <[hidden email]>
> wrote:
> > >
> > >> If you’re suggesting ways to make it easier to use something like
> > >> YaHPConverter with Tika, definitely yes.
> > >>
> > >> If you’re talking about integrating this functionality…my personal
> view
> > is
> > >> no.
> > >>
> > >> I think Tika should focus on extracting content from documents, versus
> > >> format transformations.
> > >>
> > >> Tika is an attractive location for functionality like this, since it
> > sits
> > >> in the middle of a lot of data processing pipelines, but I worry
> about a
> > >> bloated code base, with corresponding challenges in maintenance and
> > support.
> > >>
> > >> Regards,
> > >>
> > >> — Ken
> > >>
> > >>
> > >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <[hidden email]>
> > >> wrote:
> > >>>
> > >>> Hi All
> > >>>
> > >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> > >>> colleagues pointed to
> > >>>
> > >>
> >
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> > >>>
> > >>> Does it make sense for Tika to offer something related to the text to
> > PDF
> > >>> (for a start, something on top of that transformer), and then may be
> > even
> > >>> for other formats ?
> > >>>
> > >>> Sergey
> > >>
> > >> --------------------------
> > >> Ken Krugler
> > >> http://www.scaleunlimited.com
> > >> custom big data solutions & training
> > >> Hadoop, Cascading, Cassandra & Solr
> > >>
> > >>
> >
> > --------------------------
> > Ken Krugler
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
>