[jira] Created: (NUTCH-560) protocol-httpclient reading more bytes than http.content.limit

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-560) protocol-httpclient reading more bytes than http.content.limit

JIRA jira@apache.org
protocol-httpclient reading more bytes than http.content.limit
--------------------------------------------------------------

                 Key: NUTCH-560
                 URL: https://issues.apache.org/jira/browse/NUTCH-560
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0, 1.0.0
            Reporter: Joseph M.


I modified protocol-httpclient HttpResponse.java to download files to file system. If I set http.content.limit to 5000... it fetches around 5500 to 6000 bytes instead and downloads it to file system. There is calculation mistake in calculateTryToRead() function.

{code}
        int tryAndRead = calculateTryToRead(totalRead);
        while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1 && tryAndRead > 0) {
          totalRead += bufferFilled;
          out.write(buffer, 0, bufferFilled);
          tryAndRead = calculateTryToRead(totalRead);
        }{code}

while loop stops when calculateTryToRead() returns -ve or 0.

  {code}private int calculateTryToRead(int totalRead) {
    int tryToRead = Http.BUFFER_SIZE;
    if (http.getMaxContent() <= 0) {
      return http.BUFFER_SIZE;
    } else if (http.getMaxContent() - totalRead < http.BUFFER_SIZE) {
      tryToRead = http.getMaxContent() - totalRead;
    }
    return tryToRead;
  }{code}

It is returning -ve when totalRead > http.getMaxContent(). So more bytes than http.content.limit is read before breaking while loop.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-560) protocol-httpclient reading more bytes than http.content.limit

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530519 ]

Susam Pal commented on NUTCH-560:
---------------------------------

I analysed 'protocol-http' and it behaves almost in the same manner. While buffering, we can not stop reading after exactly 'http.content.limit' bytes have been read. It would be one iteration after the limit, when the limit check tells that we have exceeded the limit. So, this doesn't seem like a bug. However, it doesn't take care of reading till 'Content-Length' bytes, which NUTCH-559 is doing.

> protocol-httpclient reading more bytes than http.content.limit
> --------------------------------------------------------------
>
>                 Key: NUTCH-560
>                 URL: https://issues.apache.org/jira/browse/NUTCH-560
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0, 1.0.0
>            Reporter: Joseph M.
>
> I modified protocol-httpclient HttpResponse.java to download files to file system. If I set http.content.limit to 5000... it fetches around 5500 to 6000 bytes instead and downloads it to file system. There is calculation mistake in calculateTryToRead() function.
> {code}
>         int tryAndRead = calculateTryToRead(totalRead);
>         while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1 && tryAndRead > 0) {
>           totalRead += bufferFilled;
>           out.write(buffer, 0, bufferFilled);
>           tryAndRead = calculateTryToRead(totalRead);
>         }{code}
> while loop stops when calculateTryToRead() returns -ve or 0.
>   {code}private int calculateTryToRead(int totalRead) {
>     int tryToRead = Http.BUFFER_SIZE;
>     if (http.getMaxContent() <= 0) {
>       return http.BUFFER_SIZE;
>     } else if (http.getMaxContent() - totalRead < http.BUFFER_SIZE) {
>       tryToRead = http.getMaxContent() - totalRead;
>     }
>     return tryToRead;
>   }{code}
> It is returning -ve when totalRead > http.getMaxContent(). So more bytes than http.content.limit is read before breaking while loop.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-560) protocol-httpclient reading more bytes than http.content.limit

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-560.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Doğacan Güney

Fixed as part of NUTCH-559.

> protocol-httpclient reading more bytes than http.content.limit
> --------------------------------------------------------------
>
>                 Key: NUTCH-560
>                 URL: https://issues.apache.org/jira/browse/NUTCH-560
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0, 1.0.0
>            Reporter: Joseph M.
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>
> I modified protocol-httpclient HttpResponse.java to download files to file system. If I set http.content.limit to 5000... it fetches around 5500 to 6000 bytes instead and downloads it to file system. There is calculation mistake in calculateTryToRead() function.
> {code}
>         int tryAndRead = calculateTryToRead(totalRead);
>         while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1 && tryAndRead > 0) {
>           totalRead += bufferFilled;
>           out.write(buffer, 0, bufferFilled);
>           tryAndRead = calculateTryToRead(totalRead);
>         }{code}
> while loop stops when calculateTryToRead() returns -ve or 0.
>   {code}private int calculateTryToRead(int totalRead) {
>     int tryToRead = Http.BUFFER_SIZE;
>     if (http.getMaxContent() <= 0) {
>       return http.BUFFER_SIZE;
>     } else if (http.getMaxContent() - totalRead < http.BUFFER_SIZE) {
>       tryToRead = http.getMaxContent() - totalRead;
>     }
>     return tryToRead;
>   }{code}
> It is returning -ve when totalRead > http.getMaxContent(). So more bytes than http.content.limit is read before breaking while loop.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.