MapReduce and segment merging

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

MapReduce and segment merging

Mike Alulin
Is it possible to merge segments in the map reduce version of Nutch?

               
---------------------------------
Yahoo! Photos – Showcase holiday pictures in hardcover
 Photo Books. You design it and we’ll bind it!
Reply | Threaded
Open this post in threaded view
|

Re: MapReduce and segment merging

Andrzej Białecki-2
Mike Alulin wrote:
> Is it possible to merge segments in the map reduce version of Nutch?
>  

Not yet.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: MapReduce and segment merging

Mike Alulin
Then how people uses the new version if they need let's say daily crawls of the new/updated pages? I crawl updated pages every 24 hours and if I do not merge the segments, soon I will have hundreds of them. What is the best solution in this case?
   
  Full recrawl is not a good option as i have millions of documents and I DO know which of them were updated without requesting them.
 

Andrzej Bialecki <[hidden email]> wrote:  Mike Alulin wrote:
> Is it possible to merge segments in the map reduce version of Nutch?
>

Not yet.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com





                       
---------------------------------
Yahoo! Photos
 Got holiday prints? See all the ways to get quality prints in your hands ASAP.
Reply | Threaded
Open this post in threaded view
|

Re: MapReduce and segment merging

Andrzej Białecki-2
Mike Alulin wrote:
> Then how people uses the new version if they need let's say daily crawls of the new/updated pages? I crawl updated pages every 24 hours and if I do not merge the segments, soon I will have hundreds of them. What is the best solution in this case?
>    
>   Full recrawl is not a good option as i have millions of documents and I DO know which of them were updated without requesting them.
>  

This is a development version, nobody said it's feature complete.
Patience, my friend... or spend some effort to improve it. ;-)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: MapReduce and segment merging

Goldschmidt, Dave
In reply to this post by Mike Alulin
Could you also just copy segments out of NDFS to local -- perform merges
in local -- then copy segments back into NDFS?

DaveG


-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Thursday, January 12, 2006 2:14 PM
To: [hidden email]
Subject: Re: MapReduce and segment merging

Mike Alulin wrote:
> Then how people uses the new version if they need let's say daily
crawls of the new/updated pages? I crawl updated pages every 24 hours
and if I do not merge the segments, soon I will have hundreds of them.
What is the best solution in this case?
>    
>   Full recrawl is not a good option as i have millions of documents
and I DO know which of them were updated without requesting them.
>  

This is a development version, nobody said it's feature complete.
Patience, my friend... or spend some effort to improve it. ;-)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: MapReduce and segment merging

Byron Miller-2
I was thinking that Nutch needs some sort of workflow
manager. This way you could build jobs off specific
workflows and hopefully recover jobs based upon the
portion of the workflow they are stuck. (or restart a
job if failed/processing time > x hours and other such
workflow processes rules)

Something like that could also send notifications of
jobs done, trigger other events and create a
management interface to what your cluster is up to or
apply configuration types to be defigned based upon
batch job/workflow process "in process".  For example
if i'm building a blog index i may want more smaller
segments based upon daily fetches while for other jobs
i may want less larger segments.

Does something like that make much sense for where
mapred branch is going?

is workflow the right term for such beast?

-byron



--- "Goldschmidt, Dave" <[hidden email]>
wrote:

> Could you also just copy segments out of NDFS to
> local -- perform merges
> in local -- then copy segments back into NDFS?
>
> DaveG
>
>
> -----Original Message-----
> From: Andrzej Bialecki [mailto:[hidden email]]
> Sent: Thursday, January 12, 2006 2:14 PM
> To: [hidden email]
> Subject: Re: MapReduce and segment merging
>
> Mike Alulin wrote:
> > Then how people uses the new version if they need
> let's say daily
> crawls of the new/updated pages? I crawl updated
> pages every 24 hours
> and if I do not merge the segments, soon I will have
> hundreds of them.
> What is the best solution in this case?
> >    
> >   Full recrawl is not a good option as i have
> millions of documents
> and I DO know which of them were updated without
> requesting them.
> >  
>
> This is a development version, nobody said it's
> feature complete.
> Patience, my friend... or spend some effort to
> improve it. ;-)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>