Quick questions - merging/deduping

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Quick questions - merging/deduping

Lucifersam
Hi,

I'm new to Nutch and am trying to get my head around some basics... I need to index two sites, one of which is under my control, into a single search.

The first site, under my control, I have ran a complete 'seed' crawl over and would like to update the index daily. To avoid recrawling the whole site I have set up a 'what's new/changed' page which I want to crawl daily to pick up any changes. I then want to merge this with the complete crawl to produce an up to date index. (I tried the recrawl script from the wiki but it didn't seem to be doing what I wanted).

I have merged the two indexes in the following way:

- created a new directory mergedcrawl
- copied seedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00000
- copied changedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00001
- ran 'bin/nutch dedup indexes' on mergedcrawl
- ran 'bin/nutch merge index indexes' on mergedcrawl
- copied /segments/* from both crawls into the mergedcrawl

Pointing the searcher.dir to the new directory, the search seems to return results from both indexes successfully. Is this the correct way to do this?

The second site is not under my control, so I need to find an alternative way to keep the index up to date. Am I correct in thinking that simply recrawling the whole site is the easiest way to do this - or is there a way to only index modified pages?

Finally - I seem to have a problem with identical pages with different urls - i.e.

http://website/
http://website/default.htm

I was under the impression that these would be removed by the dedup process, but this does not seem to be working. Is there something I'm missing? (I also have a similar problem with the external site as it carries session ids around in the URL which change - although the content of the duplicate pages is identical).

Sorry for the long post - any help is appreciated!

(EDIT - should have said, I'm using Nutch 0.8.1...)
Reply | Threaded
Open this post in threaded view
|

Re: Quick questions - merging/deduping

Andrzej Białecki-2
Lucifersam wrote:
> Finally - I seem to have a problem with identical pages with different urls
> - i.e.
>
> http://website/
> http://website/default.htm
>
> I was under the impression that these would be removed by the dedup process,
> but this does not seem to be working. Is there something I'm missing?

Most likely the pages are slightly different - you can save them to
files, and then run a diff utility to check for differences.


> (I
> also have a similar problem with the external site as it carries session ids
> around in the URL which change - although the content of the duplicate pages
> is identical).
>  

You can remove session IDs using URLNormalizers - see e.g. the
regex-urlnormalizer.xml for an example how to do this.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Quick questions - merging/deduping

Lucifersam
Andrzej Bialecki wrote
Lucifersam wrote:
> Finally - I seem to have a problem with identical pages with different urls
> - i.e.
>
> http://website/
> http://website/default.htm
>
> I was under the impression that these would be removed by the dedup process,
> but this does not seem to be working. Is there something I'm missing?

Most likely the pages are slightly different - you can save them to
files, and then run a diff utility to check for differences.
You're right, there was a small difference in the HTML concerning some timing comment, e.g:



As this is not strictly content - is there a simply way to ignore anything within comments when looking at the content of a page?

Andrzej Bialecki wrote
> (I
> also have a similar problem with the external site as it carries session ids
> around in the URL which change - although the content of the duplicate pages
> is identical).
>  

You can remove session IDs using URLNormalizers - see e.g. the
regex-urlnormalizer.xml for an example how to do this.
Thanks - I will look into this.
Reply | Threaded
Open this post in threaded view
|

Re: Quick questions - merging/deduping

Andrzej Białecki-2
Lucifersam wrote:

> Andrzej Bialecki wrote:
>  
>> Lucifersam wrote:
>>    
>>> Finally - I seem to have a problem with identical pages with different
>>> urls
>>> - i.e.
>>>
>>> http://website/
>>> http://website/default.htm
>>>
>>> I was under the impression that these would be removed by the dedup
>>> process,
>>> but this does not seem to be working. Is there something I'm missing?
>>>      
>> Most likely the pages are slightly different - you can save them to
>> files, and then run a diff utility to check for differences.
>>
>>    
>
> You're right, there was a small difference in the HTML concerning some
> timing comment, e.g:
>
> <!--Exec time = 265.625-->
>
> As this is not strictly content - is there a simply way to ignore anything
> within comments when looking at the content of a page?
>  


You can provide your own implementation of a Signature - please see the
javadocs for this class - and then set this class in nutch-site.xml.

A common trick is to use just the plain text version of the page, and
further "normalize" it by replacing all whitespace with exactly single
spaces, bringing all tokens to lowercase, optionally filter out all
digits, and also optionally removing all words that occur only once.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Quick questions - merging/deduping

Lucifersam

Andrzej Bialecki wrote
Lucifersam wrote:
> Andrzej Bialecki wrote:
>  
>> Lucifersam wrote:
>>    
>>> Finally - I seem to have a problem with identical pages with different
>>> urls
>>> - i.e.
>>>
>>> http://website/
>>> http://website/default.htm
>>>
>>> I was under the impression that these would be removed by the dedup
>>> process,
>>> but this does not seem to be working. Is there something I'm missing?
>>>      
>> Most likely the pages are slightly different - you can save them to
>> files, and then run a diff utility to check for differences.
>>
>>    
>
> You're right, there was a small difference in the HTML concerning some
> timing comment, e.g:
>
>
>
> As this is not strictly content - is there a simply way to ignore anything
> within comments when looking at the content of a page?
>  


You can provide your own implementation of a Signature - please see the
javadocs for this class - and then set this class in nutch-site.xml.

A common trick is to use just the plain text version of the page, and
further "normalize" it by replacing all whitespace with exactly single
spaces, bringing all tokens to lowercase, optionally filter out all
digits, and also optionally removing all words that occur only once.
Thanks for the suggestions - I will look into this.

Any comments/suggestions regarding the methods I am using to keep the index up to date?
Reply | Threaded
Open this post in threaded view
|

Re: Quick questions - merging/deduping

Andrzej Białecki-2
Lucifersam wrote:
> Thanks for the suggestions - I will look into this.
>
> Any comments/suggestions regarding the methods I am using to keep the index
> up to date?
>
>  

I can't see anything wrong with it, on the conceptual level - at the end
indexes are dedup-ed and merged, so it should work fine.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com