Summarier threads in nutch

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Summarier threads in nutch

Jack.Tang
Hi Guys

In FetchedSegments class, below code shows how to get the hit summaries.

  public String[] getSummary(HitDetails[] details, Query query)
    throws IOException {
    SummaryThread[] threads = new SummaryThread[details.length];
    for (int i = 0; i < threads.length; i++) {
      threads[i] = new SummaryThread(details[i], query);
      threads[i].start();
    }
    ......
  }

It means if the hits are 1,000,000 items, then 1,000,000 threads
should be spawned. But in fact we read the hits page by page, so why
not ctor one thread pool whose size is page size and we retrieve the
summary on demand? Is this sounds better?

/Jack

--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Jack.Tang
Hi

Can someone explain the original design?
And I suggest to refactor the API (FetchedSegments.class) to
public String[] getSummary(HitDetails[] details, int hitStart, int
hitEnd, Query query) {
 ....
}

Does this make sense?

/Jack

On 2/20/06, Jack Tang <[hidden email]> wrote:

> Hi Guys
>
> In FetchedSegments class, below code shows how to get the hit summaries.
>
>   public String[] getSummary(HitDetails[] details, Query query)
>     throws IOException {
>     SummaryThread[] threads = new SummaryThread[details.length];
>     for (int i = 0; i < threads.length; i++) {
>       threads[i] = new SummaryThread(details[i], query);
>       threads[i].start();
>     }
>     ......
>   }
>
> It means if the hits are 1,000,000 items, then 1,000,000 threads
> should be spawned. But in fact we read the hits page by page, so why
> not ctor one thread pool whose size is page size and we retrieve the
> summary on demand? Is this sounds better?
>
> /Jack
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Doug Cutting
In reply to this post by Jack.Tang
Jack Tang wrote:

> In FetchedSegments class, below code shows how to get the hit summaries.
>
>   public String[] getSummary(HitDetails[] details, Query query)
>     throws IOException {
>     SummaryThread[] threads = new SummaryThread[details.length];
>     for (int i = 0; i < threads.length; i++) {
>       threads[i] = new SummaryThread(details[i], query);
>       threads[i].start();
>     }
>     ......
>   }
>
> It means if the hits are 1,000,000 items, then 1,000,000 threads
> should be spawned.

A user interface typically only asks for 10-to-20 summaries at a time.
I do not believe that a thread pool would be substantially faster.
Thread spawning is pretty cheap in most JVMs.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Jack.Tang
On 2/23/06, Doug Cutting <[hidden email]> wrote:

> Jack Tang wrote:
> > In FetchedSegments class, below code shows how to get the hit summaries.
> >
> >   public String[] getSummary(HitDetails[] details, Query query)
> >     throws IOException {
> >     SummaryThread[] threads = new SummaryThread[details.length];
> >     for (int i = 0; i < threads.length; i++) {
> >       threads[i] = new SummaryThread(details[i], query);
> >       threads[i].start();
> >     }
> >     ......
> >   }
> >
> > It means if the hits are 1,000,000 items, then 1,000,000 threads
> > should be spawned.
>
> A user interface typically only asks for 10-to-20 summaries at a time.
Hi Doug
Did I miss something?

SummaryThread[] threads = new SummaryThread[details.length];
here details.length is the size of one page hit items?
I thought it should be the value of all hits, right?

/Jack

> I do not believe that a thread pool would be substantially faster.
> Thread spawning is pretty cheap in most JVMs.
>
> Doug
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Stefan Groschupf-2
Hi Jack,
the summary is only created from all hits displayed on one page.

Stefan

Am 23.02.2006 um 02:45 schrieb Jack Tang:

> On 2/23/06, Doug Cutting <[hidden email]> wrote:
>> Jack Tang wrote:
>>> In FetchedSegments class, below code shows how to get the hit  
>>> summaries.
>>>
>>>   public String[] getSummary(HitDetails[] details, Query query)
>>>     throws IOException {
>>>     SummaryThread[] threads = new SummaryThread[details.length];
>>>     for (int i = 0; i < threads.length; i++) {
>>>       threads[i] = new SummaryThread(details[i], query);
>>>       threads[i].start();
>>>     }
>>>     ......
>>>   }
>>>
>>> It means if the hits are 1,000,000 items, then 1,000,000 threads
>>> should be spawned.
>>
>> A user interface typically only asks for 10-to-20 summaries at a  
>> time.
> Hi Doug
> Did I miss something?
>
> SummaryThread[] threads = new SummaryThread[details.length];
> here details.length is the size of one page hit items?
> I thought it should be the value of all hits, right?
>
> /Jack
>
>> I do not believe that a thread pool would be substantially faster.
>> Thread spawning is pretty cheap in most JVMs.
>>
>> Doug
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Jack.Tang
Hi Stefan

Can you explain a little more? I mean I cannot find some evidence in
the source code...
Thanks

/Jack

On 2/23/06, Stefan Groschupf <[hidden email]> wrote:

> Hi Jack,
> the summary is only created from all hits displayed on one page.
>
> Stefan
>
> Am 23.02.2006 um 02:45 schrieb Jack Tang:
>
> > On 2/23/06, Doug Cutting <[hidden email]> wrote:
> >> Jack Tang wrote:
> >>> In FetchedSegments class, below code shows how to get the hit
> >>> summaries.
> >>>
> >>>   public String[] getSummary(HitDetails[] details, Query query)
> >>>     throws IOException {
> >>>     SummaryThread[] threads = new SummaryThread[details.length];
> >>>     for (int i = 0; i < threads.length; i++) {
> >>>       threads[i] = new SummaryThread(details[i], query);
> >>>       threads[i].start();
> >>>     }
> >>>     ......
> >>>   }
> >>>
> >>> It means if the hits are 1,000,000 items, then 1,000,000 threads
> >>> should be spawned.
> >>
> >> A user interface typically only asks for 10-to-20 summaries at a
> >> time.
> > Hi Doug
> > Did I miss something?
> >
> > SummaryThread[] threads = new SummaryThread[details.length];
> > here details.length is the size of one page hit items?
> > I thought it should be the value of all hits, right?
> >
> > /Jack
> >
> >> I do not believe that a thread pool would be substantially faster.
> >> Thread spawning is pretty cheap in most JVMs.
> >>
> >> Doug
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Stefan Groschupf-2
If you have 10 hits displayed on one page nutch     only generate 10  
summaries.
In case you have 20 hits displayed on one result page nutch generates  
20 summaries.
Summaries  will be generated as much hits you have on your result page.
Does this answer the question?

Am 24.02.2006 um 02:51 schrieb Jack Tang:

> Hi Stefan
>
> Can you explain a little more? I mean I cannot find some evidence in
> the source code...
> Thanks
>
> /Jack
>
> On 2/23/06, Stefan Groschupf <[hidden email]> wrote:
>> Hi Jack,
>> the summary is only created from all hits displayed on one page.
>>
>> Stefan
>>
>> Am 23.02.2006 um 02:45 schrieb Jack Tang:
>>
>>> On 2/23/06, Doug Cutting <[hidden email]> wrote:
>>>> Jack Tang wrote:
>>>>> In FetchedSegments class, below code shows how to get the hit
>>>>> summaries.
>>>>>
>>>>>   public String[] getSummary(HitDetails[] details, Query query)
>>>>>     throws IOException {
>>>>>     SummaryThread[] threads = new SummaryThread[details.length];
>>>>>     for (int i = 0; i < threads.length; i++) {
>>>>>       threads[i] = new SummaryThread(details[i], query);
>>>>>       threads[i].start();
>>>>>     }
>>>>>     ......
>>>>>   }
>>>>>
>>>>> It means if the hits are 1,000,000 items, then 1,000,000 threads
>>>>> should be spawned.
>>>>
>>>> A user interface typically only asks for 10-to-20 summaries at a
>>>> time.
>>> Hi Doug
>>> Did I miss something?
>>>
>>> SummaryThread[] threads = new SummaryThread[details.length];
>>> here details.length is the size of one page hit items?
>>> I thought it should be the value of all hits, right?
>>>
>>> /Jack
>>>
>>>> I do not believe that a thread pool would be substantially faster.
>>>> Thread spawning is pretty cheap in most JVMs.
>>>>
>>>> Doug
>>>>
>>>
>>>
>>> --
>>> Keep Discovering ... ...
>>> http://www.jroller.com/page/jmars
>>>
>>
>> ---------------------------------------------------------------
>> company:        http://www.media-style.com
>> forum:        http://www.text-mining.org
>> blog:            http://www.find23.net
>>
>>
>>
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Jack.Tang
I dont think so.

Let's take non-dfs as example.
NutchBean.getSummary(HitDetails, Query) invokes
FetchedSegments.getSummary(HitDetails, Query) which calls
Summarizer.getSummary(text, query, sumContext,sumLength).

In my eye there is no scope of HitDetails at all. right?

/Jack

On 2/24/06, Stefan Groschupf <[hidden email]> wrote:

> If you have 10 hits displayed on one page nutch     only generate 10
> summaries.
> In case you have 20 hits displayed on one result page nutch generates
> 20 summaries.
> Summaries  will be generated as much hits you have on your result page.
> Does this answer the question?
>
> Am 24.02.2006 um 02:51 schrieb Jack Tang:
>
> > Hi Stefan
> >
> > Can you explain a little more? I mean I cannot find some evidence in
> > the source code...
> > Thanks
> >
> > /Jack
> >
> > On 2/23/06, Stefan Groschupf <[hidden email]> wrote:
> >> Hi Jack,
> >> the summary is only created from all hits displayed on one page.
> >>
> >> Stefan
> >>
> >> Am 23.02.2006 um 02:45 schrieb Jack Tang:
> >>
> >>> On 2/23/06, Doug Cutting <[hidden email]> wrote:
> >>>> Jack Tang wrote:
> >>>>> In FetchedSegments class, below code shows how to get the hit
> >>>>> summaries.
> >>>>>
> >>>>>   public String[] getSummary(HitDetails[] details, Query query)
> >>>>>     throws IOException {
> >>>>>     SummaryThread[] threads = new SummaryThread[details.length];
> >>>>>     for (int i = 0; i < threads.length; i++) {
> >>>>>       threads[i] = new SummaryThread(details[i], query);
> >>>>>       threads[i].start();
> >>>>>     }
> >>>>>     ......
> >>>>>   }
> >>>>>
> >>>>> It means if the hits are 1,000,000 items, then 1,000,000 threads
> >>>>> should be spawned.
> >>>>
> >>>> A user interface typically only asks for 10-to-20 summaries at a
> >>>> time.
> >>> Hi Doug
> >>> Did I miss something?
> >>>
> >>> SummaryThread[] threads = new SummaryThread[details.length];
> >>> here details.length is the size of one page hit items?
> >>> I thought it should be the value of all hits, right?
> >>>
> >>> /Jack
> >>>
> >>>> I do not believe that a thread pool would be substantially faster.
> >>>> Thread spawning is pretty cheap in most JVMs.
> >>>>
> >>>> Doug
> >>>>
> >>>
> >>>
> >>> --
> >>> Keep Discovering ... ...
> >>> http://www.jroller.com/page/jmars
> >>>
> >>
> >> ---------------------------------------------------------------
> >> company:        http://www.media-style.com
> >> forum:        http://www.text-mining.org
> >> blog:            http://www.find23.net
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Stefan Groschupf-2
Isn't HitDetails.length == hitsPerPage?
This happens in search.jsp.


Am 24.02.2006 um 03:09 schrieb Jack Tang:

> I dont think so.
>
> Let's take non-dfs as example.
> NutchBean.getSummary(HitDetails, Query) invokes
> FetchedSegments.getSummary(HitDetails, Query) which calls
> Summarizer.getSummary(text, query, sumContext,sumLength).
>
> In my eye there is no scope of HitDetails at all. right?
>
> /Jack
>
> On 2/24/06, Stefan Groschupf <[hidden email]> wrote:
>> If you have 10 hits displayed on one page nutch     only generate 10
>> summaries.
>> In case you have 20 hits displayed on one result page nutch generates
>> 20 summaries.
>> Summaries  will be generated as much hits you have on your result  
>> page.
>> Does this answer the question?
>>
>> Am 24.02.2006 um 02:51 schrieb Jack Tang:
>>
>>> Hi Stefan
>>>
>>> Can you explain a little more? I mean I cannot find some evidence in
>>> the source code...
>>> Thanks
>>>
>>> /Jack
>>>
>>> On 2/23/06, Stefan Groschupf <[hidden email]> wrote:
>>>> Hi Jack,
>>>> the summary is only created from all hits displayed on one page.
>>>>
>>>> Stefan
>>>>
>>>> Am 23.02.2006 um 02:45 schrieb Jack Tang:
>>>>
>>>>> On 2/23/06, Doug Cutting <[hidden email]> wrote:
>>>>>> Jack Tang wrote:
>>>>>>> In FetchedSegments class, below code shows how to get the hit
>>>>>>> summaries.
>>>>>>>
>>>>>>>   public String[] getSummary(HitDetails[] details, Query query)
>>>>>>>     throws IOException {
>>>>>>>     SummaryThread[] threads = new SummaryThread[details.length];
>>>>>>>     for (int i = 0; i < threads.length; i++) {
>>>>>>>       threads[i] = new SummaryThread(details[i], query);
>>>>>>>       threads[i].start();
>>>>>>>     }
>>>>>>>     ......
>>>>>>>   }
>>>>>>>
>>>>>>> It means if the hits are 1,000,000 items, then 1,000,000 threads
>>>>>>> should be spawned.
>>>>>>
>>>>>> A user interface typically only asks for 10-to-20 summaries at a
>>>>>> time.
>>>>> Hi Doug
>>>>> Did I miss something?
>>>>>
>>>>> SummaryThread[] threads = new SummaryThread[details.length];
>>>>> here details.length is the size of one page hit items?
>>>>> I thought it should be the value of all hits, right?
>>>>>
>>>>> /Jack
>>>>>
>>>>>> I do not believe that a thread pool would be substantially  
>>>>>> faster.
>>>>>> Thread spawning is pretty cheap in most JVMs.
>>>>>>
>>>>>> Doug
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Keep Discovering ... ...
>>>>> http://www.jroller.com/page/jmars
>>>>>
>>>>
>>>> ---------------------------------------------------------------
>>>> company:        http://www.media-style.com
>>>> forum:        http://www.text-mining.org
>>>> blog:            http://www.find23.net
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Keep Discovering ... ...
>>> http://www.jroller.com/page/jmars
>>>
>>
>> ---------------------------------------------
>> blog: http://www.find23.org
>> company: http://www.media-style.com
>>
>>
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: Summarier threads in nutch

Jack.Tang
Yes, you're right:) i find the answer.
Thanks.

On 2/24/06, Stefan Groschupf <[hidden email]> wrote:

> Isn't HitDetails.length == hitsPerPage?
> This happens in search.jsp.
>
>
> Am 24.02.2006 um 03:09 schrieb Jack Tang:
>
> > I dont think so.
> >
> > Let's take non-dfs as example.
> > NutchBean.getSummary(HitDetails, Query) invokes
> > FetchedSegments.getSummary(HitDetails, Query) which calls
> > Summarizer.getSummary(text, query, sumContext,sumLength).
> >
> > In my eye there is no scope of HitDetails at all. right?
> >
> > /Jack
> >
> > On 2/24/06, Stefan Groschupf <[hidden email]> wrote:
> >> If you have 10 hits displayed on one page nutch     only generate 10
> >> summaries.
> >> In case you have 20 hits displayed on one result page nutch generates
> >> 20 summaries.
> >> Summaries  will be generated as much hits you have on your result
> >> page.
> >> Does this answer the question?
> >>
> >> Am 24.02.2006 um 02:51 schrieb Jack Tang:
> >>
> >>> Hi Stefan
> >>>
> >>> Can you explain a little more? I mean I cannot find some evidence in
> >>> the source code...
> >>> Thanks
> >>>
> >>> /Jack
> >>>
> >>> On 2/23/06, Stefan Groschupf <[hidden email]> wrote:
> >>>> Hi Jack,
> >>>> the summary is only created from all hits displayed on one page.
> >>>>
> >>>> Stefan
> >>>>
> >>>> Am 23.02.2006 um 02:45 schrieb Jack Tang:
> >>>>
> >>>>> On 2/23/06, Doug Cutting <[hidden email]> wrote:
> >>>>>> Jack Tang wrote:
> >>>>>>> In FetchedSegments class, below code shows how to get the hit
> >>>>>>> summaries.
> >>>>>>>
> >>>>>>>   public String[] getSummary(HitDetails[] details, Query query)
> >>>>>>>     throws IOException {
> >>>>>>>     SummaryThread[] threads = new SummaryThread[details.length];
> >>>>>>>     for (int i = 0; i < threads.length; i++) {
> >>>>>>>       threads[i] = new SummaryThread(details[i], query);
> >>>>>>>       threads[i].start();
> >>>>>>>     }
> >>>>>>>     ......
> >>>>>>>   }
> >>>>>>>
> >>>>>>> It means if the hits are 1,000,000 items, then 1,000,000 threads
> >>>>>>> should be spawned.
> >>>>>>
> >>>>>> A user interface typically only asks for 10-to-20 summaries at a
> >>>>>> time.
> >>>>> Hi Doug
> >>>>> Did I miss something?
> >>>>>
> >>>>> SummaryThread[] threads = new SummaryThread[details.length];
> >>>>> here details.length is the size of one page hit items?
> >>>>> I thought it should be the value of all hits, right?
> >>>>>
> >>>>> /Jack
> >>>>>
> >>>>>> I do not believe that a thread pool would be substantially
> >>>>>> faster.
> >>>>>> Thread spawning is pretty cheap in most JVMs.
> >>>>>>
> >>>>>> Doug
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Keep Discovering ... ...
> >>>>> http://www.jroller.com/page/jmars
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------
> >>>> company:        http://www.media-style.com
> >>>> forum:        http://www.text-mining.org
> >>>> blog:            http://www.find23.net
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Keep Discovering ... ...
> >>> http://www.jroller.com/page/jmars
> >>>
> >>
> >> ---------------------------------------------
> >> blog: http://www.find23.org
> >> company: http://www.media-style.com
> >>
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars