Faster Merging?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Faster Merging?

Jon Shoberg

   Is there any optimizations that can be done when merging segments?

I'm using -numFetchers when calling generate and then marging them back
when done.  Note the slow rec/s performance.

050926 101850 * Merging all segments into segments
050926 102420  Processed 20000 records (60.708897 rec/s)
050926 102844  Processed 40000 records (75.714554 rec/s)
050926 103337  Processed 60000 records (68.25682 rec/s)

--

050926 101521 parsing file:/data/nutch07/conf/nutch-site.xml
050926 101522 No FS indicated, using default:local
050926 101522 * Opening 200 segments:
050926 101522  - segment 20050926073716-0: 1671 records.
050926 101522  - segment 20050926073716-1: 922 records.
050926 101522  - segment 20050926073716-10: 91 records.
050926 101522  - segment 20050926073716-11: 4928 records.
050926 101522  - segment 20050926073716-12: 946 records.
050926 101522  - segment 20050926073716-13: 3306 records.
050926 101522  - segment 20050926073716-14: 1002 records.
050926 101522  - segment 20050926073716-15: 4794 records.
050926 101522  - segment 20050926073716-16: 1542 records.
050926 101523  - segment 20050926073716-17: 218 records.
050926 101523  - segment 20050926073716-18: 1438 records.
050926 101523  - segment 20050926073716-19: 1025 records.
050926 101523  - segment 20050926073716-2: 991 records.
050926 101523  - segment 20050926073716-20: 5468 records.
050926 101523  - segment 20050926073716-21: 2992 records.
050926 101523  - segment 20050926073716-22: 1934 records.
050926 101523  - segment 20050926073716-23: 1403 records.
050926 101523  - segment 20050926073716-24: 862 records.
050926 101523  - segment 20050926073716-25: 1078 records.
050926 101524  - segment 20050926073716-26: 1412 records.
050926 101524  - segment 20050926073716-27: 4199 records.
050926 101524  - segment 20050926073716-28: 1741 records.
050926 101524  - segment 20050926073716-29: 3477 records.
050926 101524  - segment 20050926073716-3: 1853 records.
050926 101524  - segment 20050926073716-30: 1866 records.
050926 101524  - segment 20050926073716-31: 462 records.
050926 101524  - segment 20050926073716-32: 2728 records.
050926 101524  - segment 20050926073716-33: 1205 records.
050926 101524  - segment 20050926073716-34: 2244 records.
050926 101524  - segment 20050926073716-35: 1656 records.
050926 101524  - segment 20050926073716-36: 1527 records.
050926 101524  - segment 20050926073716-37: 2955 records.
050926 101524  - segment 20050926073716-38: 12739 records.
050926 101524  - segment 20050926073716-39: 530 records.
050926 101524  - segment 20050926073716-4: 2753 records.
050926 101524  - segment 20050926073716-40: 1759 records.
050926 101524  - segment 20050926073716-41: 2729 records.
050926 101524  - segment 20050926073716-42: 1050 records.
050926 101524  - segment 20050926073716-43: 3044 records.
050926 101524  - segment 20050926073716-44: 780 records.
050926 101524  - segment 20050926073716-45: 950 records.
050926 101524  - segment 20050926073716-46: 2530 records.
050926 101524  - segment 20050926073716-47: 585 records.
050926 101524  - segment 20050926073716-48: 5786 records.
050926 101524  - segment 20050926073716-49: 3371 records.
050926 101525  - segment 20050926073716-5: 4956 records.
050926 101525  - segment 20050926073716-6: 1332 records.
050926 101525  - segment 20050926073716-7: 1534 records.
050926 101525  - segment 20050926073716-8: 1970 records.
050926 101525  - segment 20050926073716-9: 3662 records.
050926 101525 * TOTAL 115996 input records in 50 segments.
050926 101525 * Creating master index...
050926 101550  Processed 20000 records (785.54596 rec/s)
050926 101610  Processed 40000 records (1009.795 rec/s)
050926 101627  Processed 60000 records (1144.4922 rec/s)
050926 101717  Processed 80000 records (405.50677 rec/s)
050926 101805  Processed 100000 records (417.52783 rec/s)
050926 101845 * Creating index took 200409 ms
050926 101845 * Optimizing index took 8 ms
050926 101845 * Removing duplicate entries...
050926 101846  Processed 20000 records (21739.13 rec/s)
050926 101847  Processed 40000 records (25477.707 rec/s)
050926 101848  Processed 60000 records (26281.209 rec/s)
050926 101849  Processed 80000 records (11363.637 rec/s)
050926 101850  Processed 100000 records (28943.56 rec/s)
050926 101850 * Deduplicating took 5368 ms
050926 101850 * Merging all segments into segments
050926 102420  Processed 20000 records (60.708897 rec/s)
050926 102844  Processed 40000 records (75.714554 rec/s)
050926 103337  Processed 60000 records (68.25682 rec/s)
Reply | Threaded
Open this post in threaded view
|

is there any way to prune webdb?

Gal Nitzan
Hi,

Few questions:

1. Is there any way to remove/prune unwanted url from webdb without
deleting all webdb and than updatedb?

2. After using prune, must I use updatedb to update the webdb

3. Is there a way to remove unwanted records from fetchlist ?

4. Does generate use regex-urlfilter in the process?

5. I noticed fetcher fetches pages in the fetchlist though it should not
because of a rule in the regex-urlfilter how come?

Thanks,

Gal
Reply | Threaded
Open this post in threaded view
|

Re: is there any way to prune webdb?

Tim Archambault
Did you get an answer to this? I'd like to know how to remove urls I know
longer want to crawl as well.

On 9/26/05, Gal Nitzan <[hidden email]> wrote:

>
> Hi,
>
> Few questions:
>
> 1. Is there any way to remove/prune unwanted url from webdb without
> deleting all webdb and than updatedb?
>
> 2. After using prune, must I use updatedb to update the webdb
>
> 3. Is there a way to remove unwanted records from fetchlist ?
>
> 4. Does generate use regex-urlfilter in the process?
>
> 5. I noticed fetcher fetches pages in the fetchlist though it should not
> because of a rule in the regex-urlfilter how come?
>
> Thanks,
>
> Gal
>