New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

Andrzej Białecki-2
Hi all,

I just committed a couple of new tools, and I'd like to briefly explain
their purpose and intended use.

* CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge
several existing DBs into one. This comes useful if you ran several
partial crawls and you'd like to combine the DBs. Optionally, you can
run current URLFilters on URLs in the databases, to filter out unwanted
URLs. This works also if you run it with just one input DB, which means
that you can use this tool for weeding out unwanted URLs from a single DB.

* LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a
similar purpose as above, and with similar options. Please note that
URLFilters, if activated, will apply to both target and source URLs.
This tool can be useful if you built partial linkdb-s from groups of
segments, and then you need to integrate them into one (e.g. for
indexing or for searching). Or you can use it with a single linkdb, just
to filter out unwanted URLs.

* SegmentMerger: available as 'mergesegs'. This tool merges several
input segments into one or more output segments, with optional filtering
as above. Optionally, the output data can be divided into several
smaller segments of fixed size. There are many do-s and dont-s regarding
the use of this tool, described in Javadoc - please be sure to read them
before using. The purpose of this tool is to e.g. re-shape your segments
(in preparation for deployment to search servers), or to filter out
unwanted data, or to minimize the number of active segments.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

Lukáš Vlček
Andrzej,
Thanks for your effort!

Are you goigng to post tool descriptions somewhere on the wiki or
tutorial? It would be great if this information could be available to
people out of the dev-mail list as well.

Regards,
Lukas

On 5/9/06, Andrzej Bialecki <[hidden email]> wrote:

> Hi all,
>
> I just committed a couple of new tools, and I'd like to briefly explain
> their purpose and intended use.
>
> * CrawlDbMerger: available from bin/nutch as 'mergedb'. You can merge
> several existing DBs into one. This comes useful if you ran several
> partial crawls and you'd like to combine the DBs. Optionally, you can
> run current URLFilters on URLs in the databases, to filter out unwanted
> URLs. This works also if you run it with just one input DB, which means
> that you can use this tool for weeding out unwanted URLs from a single DB.
>
> * LinkDbMerger: available from bin/nutch as 'mergelinkdb', with a
> similar purpose as above, and with similar options. Please note that
> URLFilters, if activated, will apply to both target and source URLs.
> This tool can be useful if you built partial linkdb-s from groups of
> segments, and then you need to integrate them into one (e.g. for
> indexing or for searching). Or you can use it with a single linkdb, just
> to filter out unwanted URLs.
>
> * SegmentMerger: available as 'mergesegs'. This tool merges several
> input segments into one or more output segments, with optional filtering
> as above. Optionally, the output data can be divided into several
> smaller segments of fixed size. There are many do-s and dont-s regarding
> the use of this tool, described in Javadoc - please be sure to read them
> before using. The purpose of this tool is to e.g. re-shape your segments
> (in preparation for deployment to search servers), or to filter out
> unwanted data, or to minimize the number of active segments.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

Andrzej Białecki-2
Lukas Vlcek wrote:
> Andrzej,
> Thanks for your effort!
>
> Are you goigng to post tool descriptions somewhere on the wiki or
> tutorial? It would be great if this information could be available to
> people out of the dev-mail list as well.

If you have some spare cycles, would you be willing to do this? Take
excerpts from my email and from the Javadoc - I tried to make especially
the Javadoc as complete as possible...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

Lukáš Vlček
Andrzej,

My pleasure. I would choose the following location:
http://wiki.apache.org/nutch/DevelopmentCommandLineOptions
Let me know if you can think of anything better otherwise I'll do it.

Regards,
Lukas

On 5/9/06, Andrzej Bialecki <[hidden email]> wrote:

> Lukas Vlcek wrote:
> > Andrzej,
> > Thanks for your effort!
> >
> > Are you goigng to post tool descriptions somewhere on the wiki or
> > tutorial? It would be great if this information could be available to
> > people out of the dev-mail list as well.
>
> If you have some spare cycles, would you be willing to do this? Take
> excerpts from my email and from the Javadoc - I tried to make especially
> the Javadoc as complete as possible...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

Andrzej Białecki-2
Lukas Vlcek wrote:
> Andrzej,
>
> My pleasure. I would choose the following location:
> http://wiki.apache.org/nutch/DevelopmentCommandLineOptions
> Let me know if you can think of anything better otherwise I'll do it.

Thanks, that would be a perfect place.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

Lukáš Vlček
New Wiki pages are created:
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs

I hope I didn't introduce a lot of typos and issues into text.
Also I tried to stick to original style.

Regards,
Lukas

On 5/10/06, Andrzej Bialecki <[hidden email]> wrote:

> Lukas Vlcek wrote:
> > Andrzej,
> >
> > My pleasure. I would choose the following location:
> > http://wiki.apache.org/nutch/DevelopmentCommandLineOptions
> > Let me know if you can think of anything better otherwise I'll do it.
>
> Thanks, that would be a perfect place.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New tools: CrawlDbMerger, LinkDbMerger, SegmentMerger

Andrzej Białecki-2
Lukas Vlcek wrote:
> New Wiki pages are created:
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
>
> I hope I didn't introduce a lot of typos and issues into text.
> Also I tried to stick to original style.
>

Thanks a lot, looks great.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com