Nutch getting rid of older segments

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch getting rid of older segments

abhay
Hello,

I have a large number of segments occupying disk space. It is a good
strategy to delete old segments or it's better to merge them.

Thank you
Abhay
Reply | Threaded
Open this post in threaded view
|

Re: Nutch getting rid of older segments

Markus Jelsma-2
Hello Abhay,

You only need to keep or merge old segments if you 'quickly' need to
reindex the data, and are unable to start with a fresh crawl. If you
frequently recrawl all urls, e.g. a month, then segments older than a month
can safely be removed.

You can also do daily an monthly merges, like we do. This makes it possible
to revisit old data for research, in case websites change layout, or are no
longer customer and not being crawled anymore.

Regards,
Markus

Op di 6 apr. 2021 om 21:54 schreef Abhay Ratnaparkhi <
[hidden email]>:

> Hello,
>
> I have a large number of segments occupying disk space. It is a good
> strategy to delete old segments or it's better to merge them.
>
> Thank you
> Abhay
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch getting rid of older segments

abhay
We frequently recrawl urls (adaptive fetch from 3 to 30 days). So seems no
harm in deleting older than month segments.

Thank you.

On Wed, Apr 7, 2021 at 5:24 AM Markus Jelsma <[hidden email]>
wrote:

> Hello Abhay,
>
> You only need to keep or merge old segments if you 'quickly' need to
> reindex the data, and are unable to start with a fresh crawl. If you
> frequently recrawl all urls, e.g. a month, then segments older than a month
> can safely be removed.
>
> You can also do daily an monthly merges, like we do. This makes it possible
> to revisit old data for research, in case websites change layout, or are no
> longer customer and not being crawled anymore.
>
> Regards,
> Markus
>
> Op di 6 apr. 2021 om 21:54 schreef Abhay Ratnaparkhi <
> [hidden email]>:
>
> > Hello,
> >
> > I have a large number of segments occupying disk space. It is a good
> > strategy to delete old segments or it's better to merge them.
> >
> > Thank you
> > Abhay
> >
>