[jira] Created: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

Sebastian Nagel (Jira)
hitsPerSite-functionality "flawed": problems writing a page-navigation
----------------------------------------------------------------------

         Key: NUTCH-288
         URL: http://issues.apache.org/jira/browse/NUTCH-288
     Project: Nutch
        Type: Bug

  Components: web gui  
    Versions: 0.8-dev    
    Reporter: Stefan Neufeind


The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.

RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I  got an exception (see below).

2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
java.lang.NegativeArraySizeException
        at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
        at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
        at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
        at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
        at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
        at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
        at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
        at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
        at java.lang.Thread.run(Thread.java:595)

Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ]

Doug Cutting commented on NUTCH-288:
------------------------------------

> Is there a performant way of doing deduplication and knowing for sure how many documents are available to view?

No.  But we should probably handle this situation without throwing exceptions.

For example, look at the following:

http://www.google.com/search?q=emacs+%22doug+cutting%22&start=90

Click on the page "19" link at the bottom.  It takes you to page 16, the last page after deduplication.


> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
>          Key: NUTCH-288
>          URL: http://issues.apache.org/jira/browse/NUTCH-288
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind

>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I  got an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
>         at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
>         at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
>         at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
>         at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
>         at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413275 ]

Stefan Neufeind commented on NUTCH-288:
---------------------------------------

How do they do that? Right, I'm transfered to page 16. But if I click on page 14 this also seems to be the last page in order? Something looks strange there, too ...

And using Nutch: How should I know (using the RSS-feed) on which page I am? I'm getting the above exception - no reply, and no new "start"-value so I could compute on which page I actually am. Is there a quickfix possible somehow?

> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
>          Key: NUTCH-288
>          URL: http://issues.apache.org/jira/browse/NUTCH-288
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind

>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I  got an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
>         at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
>         at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
>         at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
>         at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
>         at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ]

Doug Cutting commented on NUTCH-288:
------------------------------------

>  Is there a quickfix possible somehow?

Someone needs to fix the OpenSearch servlet.

It looks like just changing line 146 of OpenSearchServlet.java, replacing:

    Hit[] show = hits.getHits(start, end-start);

with:

    Hit[] show = hits.getHits(start, length > 0 ? length : 0);

Give this a try.

> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
>          Key: NUTCH-288
>          URL: http://issues.apache.org/jira/browse/NUTCH-288
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind

>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I  got an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
>         at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
>         at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
>         at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
>         at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
>         at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-288?page=all ]

Stefan Neufeind updated NUTCH-288:
----------------------------------

    Attachment: NUTCH-288-OpenSearch-fix.patch

This patch includes Doug's one-line fix to prevent an exception.
Also it does go back page by page until you get to the last result-page. The start-value returned in the RSS-feed is correct afterwards(!). This easily allows you to check whether the requested result-start and the one received are identical - otherwise you are on the last page and were "redirected" - and now know that you don't need to display any pages in your page-navigation following this one :-)

Applies and works fine for me.

> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
>          Key: NUTCH-288
>          URL: http://issues.apache.org/jira/browse/NUTCH-288
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-288-OpenSearch-fix.patch
>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I  got an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
>         at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
>         at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
>         at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
>         at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
>         at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12414441 ]

Stefan Groschupf commented on NUTCH-288:
----------------------------------------

HI Stefan,
>Also it does go back page by page until you get to the last result-page
Isn't it possible to caculate the latest page instead of using a while loop to find the latest page?


> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
>          Key: NUTCH-288
>          URL: http://issues.apache.org/jira/browse/NUTCH-288
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-288-OpenSearch-fix.patch
>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads to problems when trying to offer a page-navigation (e.g. allow the user to jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set it to display 10 items per page. My "naive" approach was to estimate I have 7763/10 = 777 pages. But already when moving to page 3 I got no more searchresults (I guess because of dedup). And when moving to page 10 I  got an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
>         at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
>         at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
>         at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
>         at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
>         at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup myself and cache the RSS-result to improve performance. But a cleaner solution would imho be nice. Is there a performant way of doing deduplication and knowing for sure how many documents are available to view? For sure this would mean to dedup all search-results first ...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira