http chunked content

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

http chunked content

Stefan Groschupf-2
Hi,

looks like the http protocol plugin does not handle chunked content. :(
The method readChunkedContent is never used and readPlainContent does  
not handle chunked content.
As far I know a lot of http servers response with chunked content at  
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?
Stefan
Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Jérôme Charron
> As far I know a lot of http servers response with chunked content at
> least all that return dynamically generated pages.
> Should I file a bug?
> Any thoughts?

In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the readChunkedContent was included in the code for a future
use.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Stefan Groschupf-2
I'm almost sure that this is not related to http 1.0 requests.

Am 08.05.2006 um 03:20 schrieb Jérôme Charron:

>> As far I know a lot of http servers response with chunked content at
>> least all that return dynamically generated pages.
>> Should I file a bug?
>> Any thoughts?
>
> In fact, the requests issued from http plugin are in HTTP 1.0, so the
> servers should never return some chuncked content.
> I think that the readChunkedContent was included in the code for a  
> future
> use.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/

Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Stefan Groschupf-2
In reply to this post by Jérôme Charron
http://www.apple.com for example answer with chunked content also if  
you request with a http 1.0 header.

Am 08.05.2006 um 03:20 schrieb Jérôme Charron:

>> As far I know a lot of http servers response with chunked content at
>> least all that return dynamically generated pages.
>> Should I file a bug?
>> Any thoughts?
>
> In fact, the requests issued from http plugin are in HTTP 1.0, so the
> servers should never return some chuncked content.
> I think that the readChunkedContent was included in the code for a  
> future
> use.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/

Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Jérôme Charron
> http://www.apple.com for example answer with chunked content also if
> you request with a http 1.0 header.


Stefan,

I don't see any "Transfer-Encoding: chunked" header in responses from
www.apple.com
Furthermore, we can read in HTTP/1.1 specification that "A server MUST NOT
send
transfer-codings to an HTTP/1.0 client".

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Chris Fellows-3
> Furthermore, we can read in HTTP/1.1 specification
> that "A server MUST NOT
> send
> transfer-codings to an HTTP/1.0 client".

I once did an socket implementation against
Anonymizer. This is well established proxy service
that services $100K+ government and private contracts.

Their server always sent chunked content despite all
headers. I'm pretty sure that there are other well
established servers that send chunked content despite
the rfc.

Guessing that it might have something to do with
wanting to control content compression. All the
browsers can handle it, and that's probably all apple
is concerned with - even though they're overriding an
rfc spec req.

Chris

--- Jérôme Charron <[hidden email]> wrote:

> > http://www.apple.com for example answer with
> chunked content also if
> > you request with a http 1.0 header.
>
>
> Stefan,
>
> I don't see any "Transfer-Encoding: chunked" header
> in responses from
> www.apple.com
> Furthermore, we can read in HTTP/1.1 specification
> that "A server MUST NOT
> send
> transfer-codings to an HTTP/1.0 client".
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>

Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Chris Fellows-3
Just remembered, got around it by using HTTPClient
which handles reading the response (chunked or not)
transparently. Haven't looked at the nutch code, but
if we were to use HTTPClient 3.0.x or later, should
take care of it.

--- Chris Fellows <[hidden email]> wrote:

> > Furthermore, we can read in HTTP/1.1 specification
> > that "A server MUST NOT
> > send
> > transfer-codings to an HTTP/1.0 client".
>
> I once did an socket implementation against
> Anonymizer. This is well established proxy service
> that services $100K+ government and private
> contracts.
>
> Their server always sent chunked content despite all
> headers. I'm pretty sure that there are other well
> established servers that send chunked content
> despite
> the rfc.
>
> Guessing that it might have something to do with
> wanting to control content compression. All the
> browsers can handle it, and that's probably all
> apple
> is concerned with - even though they're overriding
> an
> rfc spec req.
>
> Chris
>
> --- Jérôme Charron <[hidden email]> wrote:
>
> > > http://www.apple.com for example answer with
> > chunked content also if
> > > you request with a http 1.0 header.
> >
> >
> > Stefan,
> >
> > I don't see any "Transfer-Encoding: chunked"
> header
> > in responses from
> > www.apple.com
> > Furthermore, we can read in HTTP/1.1 specification
> > that "A server MUST NOT
> > send
> > transfer-codings to an HTTP/1.0 client".
> >
> > Jérôme
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Andrzej Białecki-2
Chris Fellows wrote:
> Just remembered, got around it by using HTTPClient
> which handles reading the response (chunked or not)
> transparently. Haven't looked at the nutch code, but
> if we were to use HTTPClient 3.0.x or later, should
> take care of it.
>
>  

Take a look at protocol-httpclient. This discussion is on whether/how to
fix protocol-http. The other plugin already supports this.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: http chunked content

Chris Fellows-3
Okay, saw the code in the http-protocol plugin. I
remember looking at this about a year ago. RFC 2616
(HTTP/1.1) does say, as Jerome pointed out:

"A server MUST NOT send transfer-codings to an
HTTP/1.0 client."

Regardless, I can attest that there are servers out
there that return chunked content regardless of the
client.

We had a socket implementation akin to the
HttpResponse.java in http-protocol plugin and were
stumped on how to handle identifying whether the
response was chunked or not - as we could not reliably
use the Transfer-coding header. The only way we could
see was trying to use the initial hex characters
denoting the size of the first chunk.

"The chunk-size field is a string of hex digits
indicating the size of the chunk. The chunked encoding
is ended by any chunk whose size is zero, followed by
the trailer, which is terminated by an empty line." -
more from RFC 2616

But in practice this was error prone. Switching over
to apache httpclient eliminated this problem, as it
transparently handles chunked and un-chunked content.
But httpclient is much more heavy weight and so the
conversion could only be done after implementing some
basic resource pooling on the primary httpclient
object.

It does look like this would be a serious refactor job
as nutch uses all java.net classes. On the other hand,
it might simplify some areas of the nutch protocol
classes and httpclient does have some interesting
built in support for multi-threading/performance
tuning requests.

I hope this helps towards a solution.

Best Regards,

Chris

--- Andrzej Bialecki <[hidden email]> wrote:

> Chris Fellows wrote:
> > Just remembered, got around it by using HTTPClient
> > which handles reading the response (chunked or
> not)
> > transparently. Haven't looked at the nutch code,
> but
> > if we were to use HTTPClient 3.0.x or later,
> should
> > take care of it.
> >
> >  
>
> Take a look at protocol-httpclient. This discussion
> is on whether/how to
> fix protocol-http. The other plugin already supports
> this.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>