Removing or reindexing a URL?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing or reindexing a URL?

Benjamin Higgins
Hello,

I'm trying to get Nutch suitable to use for our (extensive) intranet.  One
problem I'm trying to solve is how best to tell Nutch to either reindex or
remove a URL from the index.  I have a lot of pages that get changed, added
and removed daily, and I'd prefer to have the changes reflected in Nutch's
index immediately.

I am able to generate a list of URLs that have changed or have been removed,
so I definately do not need to reindex everything, I just need a way to pass
this list on to Nutch.

How can I do this?

Ben
Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Stefan Groschupf-2
Just recrawl and reindex every day. That was the simple answer.
The more complex answer is you need to do write custom code that  
deletes documents from your index and crawld.
If you not want to complete learn the internals of nutch, just  
recrawl and reindex. :)

Stefan
Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:

> Hello,
>
> I'm trying to get Nutch suitable to use for our (extensive)  
> intranet.  One
> problem I'm trying to solve is how best to tell Nutch to either  
> reindex or
> remove a URL from the index.  I have a lot of pages that get  
> changed, added
> and removed daily, and I'd prefer to have the changes reflected in  
> Nutch's
> index immediately.
>
> I am able to generate a list of URLs that have changed or have been  
> removed,
> so I definately do not need to reindex everything, I just need a  
> way to pass
> this list on to Nutch.
>
> How can I do this?
>
> Ben

Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Benjamin Higgins
With my tests, I index ~60k documents.  This process takes several hours.  I
plan on having about a half million documents index eventually, and I
suspect it'll take more than 24 hours to recrawl and reindex with my
hardware, so I'm concerned.

I *know* which documents I want to reindex or remove.  It's going to be a
very small subset compared to the whole group (I imagine around 1000
pages).  That's why I desperately want to be able to give Nutch a list of
documents.

Ben

On 6/8/06, Stefan Groschupf <[hidden email]> wrote:

>
> Just recrawl and reindex every day. That was the simple answer.
> The more complex answer is you need to do write custom code that
> deletes documents from your index and crawld.
> If you not want to complete learn the internals of nutch, just
> recrawl and reindex. :)
>
> Stefan
> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>
> > Hello,
> >
> > I'm trying to get Nutch suitable to use for our (extensive)
> > intranet.  One
> > problem I'm trying to solve is how best to tell Nutch to either
> > reindex or
> > remove a URL from the index.  I have a lot of pages that get
> > changed, added
> > and removed daily, and I'd prefer to have the changes reflected in
> > Nutch's
> > index immediately.
> >
> > I am able to generate a list of URLs that have changed or have been
> > removed,
> > so I definately do not need to reindex everything, I just need a
> > way to pass
> > this list on to Nutch.
> >
> > How can I do this?
> >
> > Ben
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Stefan Groschupf-2
Ben,
you can remove pages from the index (prune tool) and may be generate  
a segment that just contains updated or new urls and merge than these  
back into the index.
However this is a manually / shell scrip kind of thing and take some  
time to do so each day.
Try to improve the configuration of your system. 40 pages / sec  
fetching and 1000++ pages / sec indexing should be possible on a  
"normal" box today.
Stefan

Am 09.06.2006 um 01:30 schrieb Benjamin Higgins:

> With my tests, I index ~60k documents.  This process takes several  
> hours.  I
> plan on having about a half million documents index eventually, and I
> suspect it'll take more than 24 hours to recrawl and reindex with my
> hardware, so I'm concerned.
>
> I *know* which documents I want to reindex or remove.  It's going  
> to be a
> very small subset compared to the whole group (I imagine around 1000
> pages).  That's why I desperately want to be able to give Nutch a  
> list of
> documents.
>
> Ben
>
> On 6/8/06, Stefan Groschupf <[hidden email]> wrote:
>>
>> Just recrawl and reindex every day. That was the simple answer.
>> The more complex answer is you need to do write custom code that
>> deletes documents from your index and crawld.
>> If you not want to complete learn the internals of nutch, just
>> recrawl and reindex. :)
>>
>> Stefan
>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>
>> > Hello,
>> >
>> > I'm trying to get Nutch suitable to use for our (extensive)
>> > intranet.  One
>> > problem I'm trying to solve is how best to tell Nutch to either
>> > reindex or
>> > remove a URL from the index.  I have a lot of pages that get
>> > changed, added
>> > and removed daily, and I'd prefer to have the changes reflected in
>> > Nutch's
>> > index immediately.
>> >
>> > I am able to generate a list of URLs that have changed or have been
>> > removed,
>> > so I definately do not need to reindex everything, I just need a
>> > way to pass
>> > this list on to Nutch.
>> >
>> > How can I do this?
>> >
>> > Ben
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Howie Wang
In reply to this post by Benjamin Higgins
If you don't mind changing the source a little, I would change
the org.apache.nutch.db.WebDBInjector.java file so that
when you try to inject a url that is already there, it will update
it's next fetch date so that it will get fetched during the next
crawl.

In WebDBInjector.java in the addPage method, change:

  dbWriter.addPageIfNotPresent(page);

to:

  dbWriter.addPageWithScore(page);

Every day you can take your list of changed/deleted urls and do:

    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt

Then do your crawl as usual. The updated pages will be refetched.
The deleted pages will attempt to be refetched, but will error out,
and be removed from the index.

You could also set your db.default.fetch.interval parameter to
longer than 30 days if you are sure you know what pages are changing.

Howie

>With my tests, I index ~60k documents.  This process takes several hours.  
>I
>plan on having about a half million documents index eventually, and I
>suspect it'll take more than 24 hours to recrawl and reindex with my
>hardware, so I'm concerned.
>
>I *know* which documents I want to reindex or remove.  It's going to be a
>very small subset compared to the whole group (I imagine around 1000
>pages).  That's why I desperately want to be able to give Nutch a list of
>documents.
>
>Ben
>
>On 6/8/06, Stefan Groschupf <[hidden email]> wrote:
>>
>>Just recrawl and reindex every day. That was the simple answer.
>>The more complex answer is you need to do write custom code that
>>deletes documents from your index and crawld.
>>If you not want to complete learn the internals of nutch, just
>>recrawl and reindex. :)
>>
>>Stefan
>>Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>
>> > Hello,
>> >
>> > I'm trying to get Nutch suitable to use for our (extensive)
>> > intranet.  One
>> > problem I'm trying to solve is how best to tell Nutch to either
>> > reindex or
>> > remove a URL from the index.  I have a lot of pages that get
>> > changed, added
>> > and removed daily, and I'd prefer to have the changes reflected in
>> > Nutch's
>> > index immediately.
>> >
>> > I am able to generate a list of URLs that have changed or have been
>> > removed,
>> > so I definately do not need to reindex everything, I just need a
>> > way to pass
>> > this list on to Nutch.
>> >
>> > How can I do this?
>> >
>> > Ben
>>
>>


Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Benjamin Higgins
Stefan, thank you.  I certainly do not mind writing a shell script or
changing some source.  This is all coming off of one box, so I do worry that
I'd not be able to fit a whole recrawl/reindex in one night once I expand
the crawl to all pages (most are dynamic/drawn from db, and the box is a
little older).

Howie, thanks for this suggestion.  I'm assuming that addPagesIfNotPresent
simply checks first (to make sure the page isn't present), and then calls
addPageWithScore.

I'll try what Howie describes and if that doesn't work out I'll write a
script that prunes then injects.

Thanks, I really do appreciate it!

Ben

On 6/8/06, Howie Wang <[hidden email]> wrote:

>
> If you don't mind changing the source a little, I would change
> the org.apache.nutch.db.WebDBInjector.java file so that
> when you try to inject a url that is already there, it will update
> it's next fetch date so that it will get fetched during the next
> crawl.
>
> In WebDBInjector.java in the addPage method, change:
>
>   dbWriter.addPageIfNotPresent(page);
>
> to:
>
>   dbWriter.addPageWithScore(page);
>
> Every day you can take your list of changed/deleted urls and do:
>
>     bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
>
> Then do your crawl as usual. The updated pages will be refetched.
> The deleted pages will attempt to be refetched, but will error out,
> and be removed from the index.
>
> You could also set your db.default.fetch.interval parameter to
> longer than 30 days if you are sure you know what pages are changing.
>
> Howie
>
> >With my tests, I index ~60k documents.  This process takes several hours.
> >I
> >plan on having about a half million documents index eventually, and I
> >suspect it'll take more than 24 hours to recrawl and reindex with my
> >hardware, so I'm concerned.
> >
> >I *know* which documents I want to reindex or remove.  It's going to be a
> >very small subset compared to the whole group (I imagine around 1000
> >pages).  That's why I desperately want to be able to give Nutch a list of
> >documents.
> >
> >Ben
> >
> >On 6/8/06, Stefan Groschupf <[hidden email]> wrote:
> >>
> >>Just recrawl and reindex every day. That was the simple answer.
> >>The more complex answer is you need to do write custom code that
> >>deletes documents from your index and crawld.
> >>If you not want to complete learn the internals of nutch, just
> >>recrawl and reindex. :)
> >>
> >>Stefan
> >>Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
> >>
> >> > Hello,
> >> >
> >> > I'm trying to get Nutch suitable to use for our (extensive)
> >> > intranet.  One
> >> > problem I'm trying to solve is how best to tell Nutch to either
> >> > reindex or
> >> > remove a URL from the index.  I have a lot of pages that get
> >> > changed, added
> >> > and removed daily, and I'd prefer to have the changes reflected in
> >> > Nutch's
> >> > index immediately.
> >> >
> >> > I am able to generate a list of URLs that have changed or have been
> >> > removed,
> >> > so I definately do not need to reindex everything, I just need a
> >> > way to pass
> >> > this list on to Nutch.
> >> >
> >> > How can I do this?
> >> >
> >> > Ben
> >>
> >>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Stefan Neufeind
In reply to this post by Howie Wang
How about making this a commandline-option to inject? Could you create
an improvement-patch?


Regards,
  Stefan

Howie Wang wrote:

> If you don't mind changing the source a little, I would change
> the org.apache.nutch.db.WebDBInjector.java file so that
> when you try to inject a url that is already there, it will update
> it's next fetch date so that it will get fetched during the next
> crawl.
>
> In WebDBInjector.java in the addPage method, change:
>
>  dbWriter.addPageIfNotPresent(page);
>
> to:
>
>  dbWriter.addPageWithScore(page);
>
> Every day you can take your list of changed/deleted urls and do:
>
>    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
>
> Then do your crawl as usual. The updated pages will be refetched.
> The deleted pages will attempt to be refetched, but will error out,
> and be removed from the index.
>
> You could also set your db.default.fetch.interval parameter to
> longer than 30 days if you are sure you know what pages are changing.
>
> Howie
>
>> With my tests, I index ~60k documents.  This process takes several
>> hours.  I
>> plan on having about a half million documents index eventually, and I
>> suspect it'll take more than 24 hours to recrawl and reindex with my
>> hardware, so I'm concerned.
>>
>> I *know* which documents I want to reindex or remove.  It's going to be a
>> very small subset compared to the whole group (I imagine around 1000
>> pages).  That's why I desperately want to be able to give Nutch a list of
>> documents.
>>
>> Ben
>>
>> On 6/8/06, Stefan Groschupf <[hidden email]> wrote:
>>>
>>> Just recrawl and reindex every day. That was the simple answer.
>>> The more complex answer is you need to do write custom code that
>>> deletes documents from your index and crawld.
>>> If you not want to complete learn the internals of nutch, just
>>> recrawl and reindex. :)
>>>
>>> Stefan
>>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>>
>>> > Hello,
>>> >
>>> > I'm trying to get Nutch suitable to use for our (extensive)
>>> > intranet.  One
>>> > problem I'm trying to solve is how best to tell Nutch to either
>>> > reindex or
>>> > remove a URL from the index.  I have a lot of pages that get
>>> > changed, added
>>> > and removed daily, and I'd prefer to have the changes reflected in
>>> > Nutch's
>>> > index immediately.
>>> >
>>> > I am able to generate a list of URLs that have changed or have been
>>> > removed,
>>> > so I definately do not need to reindex everything, I just need a
>>> > way to pass
>>> > this list on to Nutch.
>>> >
>>> > How can I do this?
>>> >
>>> > Ben
Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Howie Wang
Maybe I'll play around with it this weekend.

Howie


>How about making this a commandline-option to inject? Could you create an
>improvement-patch?
>
>
>Regards,
>  Stefan
>
>Howie Wang wrote:
>>If you don't mind changing the source a little, I would change
>>the org.apache.nutch.db.WebDBInjector.java file so that
>>when you try to inject a url that is already there, it will update
>>it's next fetch date so that it will get fetched during the next
>>crawl.
>>
>>In WebDBInjector.java in the addPage method, change:
>>
>>  dbWriter.addPageIfNotPresent(page);
>>
>>to:
>>
>>  dbWriter.addPageWithScore(page);
>>
>>Every day you can take your list of changed/deleted urls and do:
>>
>>    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
>>
>>Then do your crawl as usual. The updated pages will be refetched.
>>The deleted pages will attempt to be refetched, but will error out,
>>and be removed from the index.
>>
>>You could also set your db.default.fetch.interval parameter to
>>longer than 30 days if you are sure you know what pages are changing.
>>
>>Howie
>>
>>>With my tests, I index ~60k documents.  This process takes several hours.
>>>  I
>>>plan on having about a half million documents index eventually, and I
>>>suspect it'll take more than 24 hours to recrawl and reindex with my
>>>hardware, so I'm concerned.
>>>
>>>I *know* which documents I want to reindex or remove.  It's going to be a
>>>very small subset compared to the whole group (I imagine around 1000
>>>pages).  That's why I desperately want to be able to give Nutch a list of
>>>documents.
>>>
>>>Ben
>>>
>>>On 6/8/06, Stefan Groschupf <[hidden email]> wrote:
>>>>
>>>>Just recrawl and reindex every day. That was the simple answer.
>>>>The more complex answer is you need to do write custom code that
>>>>deletes documents from your index and crawld.
>>>>If you not want to complete learn the internals of nutch, just
>>>>recrawl and reindex. :)
>>>>
>>>>Stefan
>>>>Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>>>
>>>> > Hello,
>>>> >
>>>> > I'm trying to get Nutch suitable to use for our (extensive)
>>>> > intranet.  One
>>>> > problem I'm trying to solve is how best to tell Nutch to either
>>>> > reindex or
>>>> > remove a URL from the index.  I have a lot of pages that get
>>>> > changed, added
>>>> > and removed daily, and I'd prefer to have the changes reflected in
>>>> > Nutch's
>>>> > index immediately.
>>>> >
>>>> > I am able to generate a list of URLs that have changed or have been
>>>> > removed,
>>>> > so I definately do not need to reindex everything, I just need a
>>>> > way to pass
>>>> > this list on to Nutch.
>>>> >
>>>> > How can I do this?
>>>> >
>>>> > Ben


Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Stefan Neufeind
In reply to this post by Howie Wang
Hi,

it just came to my mind, just to make sure (don't have the code at
hand): updatedb uses a different portion of code, right? Otherwise we
might re-crawl URLs we just fetched because links are found to URLs we
just fetched :-)


Regards,
 Stefan

Howie Wang wrote:

> If you don't mind changing the source a little, I would change
> the org.apache.nutch.db.WebDBInjector.java file so that
> when you try to inject a url that is already there, it will update
> it's next fetch date so that it will get fetched during the next
> crawl.
>
> In WebDBInjector.java in the addPage method, change:
>
>  dbWriter.addPageIfNotPresent(page);
>
> to:
>
>  dbWriter.addPageWithScore(page);
>
> Every day you can take your list of changed/deleted urls and do:
>
>    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
>
> Then do your crawl as usual. The updated pages will be refetched.
> The deleted pages will attempt to be refetched, but will error out,
> and be removed from the index.
>
> You could also set your db.default.fetch.interval parameter to
> longer than 30 days if you are sure you know what pages are changing.
>
> Howie
>
>> With my tests, I index ~60k documents.  This process takes several
>> hours.  I
>> plan on having about a half million documents index eventually, and I
>> suspect it'll take more than 24 hours to recrawl and reindex with my
>> hardware, so I'm concerned.
>>
>> I *know* which documents I want to reindex or remove.  It's going to be a
>> very small subset compared to the whole group (I imagine around 1000
>> pages).  That's why I desperately want to be able to give Nutch a list of
>> documents.
>>
>> Ben
>>
>> On 6/8/06, Stefan Groschupf <[hidden email]> wrote:
>>>
>>> Just recrawl and reindex every day. That was the simple answer.
>>> The more complex answer is you need to do write custom code that
>>> deletes documents from your index and crawld.
>>> If you not want to complete learn the internals of nutch, just
>>> recrawl and reindex. :)
>>>
>>> Stefan
>>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>>
>>> > Hello,
>>> >
>>> > I'm trying to get Nutch suitable to use for our (extensive)
>>> > intranet.  One
>>> > problem I'm trying to solve is how best to tell Nutch to either
>>> > reindex or
>>> > remove a URL from the index.  I have a lot of pages that get
>>> > changed, added
>>> > and removed daily, and I'd prefer to have the changes reflected in
>>> > Nutch's
>>> > index immediately.
>>> >
>>> > I am able to generate a list of URLs that have changed or have been
>>> > removed,
>>> > so I definately do not need to reindex everything, I just need a
>>> > way to pass
>>> > this list on to Nutch.
>>> >
>>> > How can I do this?
>>> >
>>> > Ben
Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Howie Wang
They're separate pieces of code so I think it should
be OK. WebDBInjector.java and UpdateDatabaseTool.java
do their own separate webdb manipulations.

Howie

>it just came to my mind, just to make sure (don't have the code at
>hand): updatedb uses a different portion of code, right? Otherwise we
>might re-crawl URLs we just fetched because links are found to URLs we
>just fetched :-)
>
>
>Regards,
>  Stefan
>
>Howie Wang wrote:
> > If you don't mind changing the source a little, I would change
> > the org.apache.nutch.db.WebDBInjector.java file so that
> > when you try to inject a url that is already there, it will update
> > it's next fetch date so that it will get fetched during the next
> > crawl.
> >
> > In WebDBInjector.java in the addPage method, change:
> >
> >  dbWriter.addPageIfNotPresent(page);
> >
> > to:
> >
> >  dbWriter.addPageWithScore(page);
> >
> > Every day you can take your list of changed/deleted urls and do:
> >
> >    bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
> >
> > Then do your crawl as usual. The updated pages will be refetched.
> > The deleted pages will attempt to be refetched, but will error out,
> > and be removed from the index.
> >
> > You could also set your db.default.fetch.interval parameter to
> > longer than 30 days if you are sure you know what pages are changing.
> >
> > Howie
> >
> >> With my tests, I index ~60k documents.  This process takes several
> >> hours.  I
> >> plan on having about a half million documents index eventually, and I
> >> suspect it'll take more than 24 hours to recrawl and reindex with my
> >> hardware, so I'm concerned.
> >>
> >> I *know* which documents I want to reindex or remove.  It's going to be
>a
> >> very small subset compared to the whole group (I imagine around 1000
> >> pages).  That's why I desperately want to be able to give Nutch a list
>of
> >> documents.
> >>
> >> Ben
> >>
> >> On 6/8/06, Stefan Groschupf <[hidden email]> wrote:
> >>>
> >>> Just recrawl and reindex every day. That was the simple answer.
> >>> The more complex answer is you need to do write custom code that
> >>> deletes documents from your index and crawld.
> >>> If you not want to complete learn the internals of nutch, just
> >>> recrawl and reindex. :)
> >>>
> >>> Stefan
> >>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
> >>>
> >>> > Hello,
> >>> >
> >>> > I'm trying to get Nutch suitable to use for our (extensive)
> >>> > intranet.  One
> >>> > problem I'm trying to solve is how best to tell Nutch to either
> >>> > reindex or
> >>> > remove a URL from the index.  I have a lot of pages that get
> >>> > changed, added
> >>> > and removed daily, and I'd prefer to have the changes reflected in
> >>> > Nutch's
> >>> > index immediately.
> >>> >
> >>> > I am able to generate a list of URLs that have changed or have been
> >>> > removed,
> >>> > so I definately do not need to reindex everything, I just need a
> >>> > way to pass
> >>> > this list on to Nutch.
> >>> >
> >>> > How can I do this?
> >>> >
> >>> > Ben


Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Andrzej Białecki-2
In reply to this post by Stefan Neufeind
Stefan Neufeind wrote:
> How about making this a commandline-option to inject? Could you create
> an improvement-patch?

FWIW, a patch with similar functionality is in my work-in-progress
queue,  however it's for 0.8 - there is no point in backporting my patch
because the architecture is very different...

Here's a snippet:
....

Index: src/java/org/apache/nutch/crawl/Injector.java
===================================================================
--- src/java/org/apache/nutch/crawl/Injector.java    (revision 412602)
+++ src/java/org/apache/nutch/crawl/Injector.java    (working copy)
@@ -20,10 +20,11 @@
 import java.util.*;
 import java.util.logging.*;
 
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.fs.*;
 import org.apache.hadoop.conf.*;
-import org.apache.hadoop.util.LogFormatter;
 import org.apache.hadoop.mapred.*;
 
 import org.apache.nutch.net.*;
@@ -35,8 +36,8 @@
 /** This class takes a flat file of URLs and adds them to the of pages
to be
  * crawled.  Useful for bootstrapping the system. */
 public class Injector extends Configured {
-  public static final Logger LOG =
-    LogFormatter.getLogger("org.apache.nutch.crawl.Injector");
+  public static final Log LOG =
+    LogFactory.getLog(Injector.class);
 
 
   /** Normalize and filter injected urls. */
@@ -46,7 +47,8 @@
     private float scoreInjected;
     private JobConf jobConf;
     private URLFilters filters;
-    private ScoringFilters scfilters;
+    private ScoringFilters scfilters;
+    private FetchSchedule schedule;
 
     public void configure(JobConf job) {
       this.jobConf = job;
@@ -55,6 +57,7 @@
       filters = new URLFilters(jobConf);
       scfilters = new ScoringFilters(jobConf);
       scoreInjected = jobConf.getFloat("db.score.injected", 1.0f);
+      schedule = FetchScheduleFactory.getFetchSchedule(job);
     }
 
     public void close() {}
@@ -69,17 +72,19 @@
         url = urlNormalizer.normalize(url);       // normalize the url
         url = filters.filter(url);             // filter the url
       } catch (Exception e) {
-        LOG.warning("Skipping " +url+":"+e);
+        LOG.warn("Skipping " +url+":"+e);
         url = null;
       }
       if (url != null) {                          // if it passes
         value.set(url);                           // collect it
-        CrawlDatum datum = new
CrawlDatum(CrawlDatum.STATUS_DB_UNFETCHED, interval);
+        CrawlDatum datum = new CrawlDatum();
+        datum.setStatus(CrawlDatum.STATUS_INJECTED);
+        schedule.initializeSchedule(value, datum);
         datum.setScore(scoreInjected);
         try {
           scfilters.initialScore(value, datum);
         } catch (ScoringFilterException e) {
-          LOG.warning("Cannot filter init score for url " + url +
+          LOG.warn("Cannot filter init score for url " + url +
                   ", using default (" + e.getMessage() + ")");
           datum.setScore(scoreInjected);
         }
@@ -90,13 +95,87 @@
 
   /** Combine multiple new entries for a url. */
   public static class InjectReducer implements Reducer {
-    public void configure(JobConf job) {}
+    private static final int RESET_NONE     = 0x0000;
+    private static final int RESET_SCORE    = 0x0001;
+    private static final int RESET_SCHEDULE = 0x0002;
+    private static final int RESET_METADATA = 0x0004;
+    private static final int RESET_ALL      = 0x00ff;
+  
+    private static final int[] masks = {
+      RESET_NONE,
+      RESET_SCORE,
+      RESET_SCHEDULE,
+      RESET_METADATA,
+      RESET_ALL
+    };
+    private static final String[] maskNames = {
+      "none",
+      "score",
+      "schedule",
+      "metadata",
+      "all"
+    };
+  
+    private CrawlDatum injected, existing;
+    private int resetMode;
+    private FetchSchedule schedule;
+    private ScoringFilters scfilters;
+    private float scoreInjected;
+  
+    public void configure(JobConf job) {
+      String mode = job.get("db.injected.reset.mask", "none");
+      List names = Arrays.asList(mode.toLowerCase().split("\\s"));
+      resetMode = RESET_NONE;
+      for (int i = 0; i < maskNames.length; i++) {
+        if (names.contains(maskNames[i])) resetMode |= masks[i];
+      }
+      scfilters = new ScoringFilters(job);
+      scoreInjected = job.getFloat("db.score.injected", 1.0f);
+      schedule = FetchScheduleFactory.getFetchSchedule(job);
+    }
+  
     public void close() {}
 
     public void reduce(WritableComparable key, Iterator values,
                        OutputCollector output, Reporter reporter)
       throws IOException {
-      output.collect(key, (Writable)values.next()); // just collect
first value
+      // there can be at most one value with status != STATUS_INJECTED
+      // and we also use only one value with status == STATUS_INJECTED
+      while (values.hasNext()) {
+        CrawlDatum datum = (CrawlDatum)values.next();
+        if (datum.getStatus() != CrawlDatum.STATUS_INJECTED) {
+          existing = datum;
+        } else {
+          injected = datum;
+        }
+      }
+      // set the status properly
+      if (injected != null)
injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
+    
+      if (existing != null) {
+        if (injected == null) {
+          output.collect(key, existing);    // no update
+        } else {
+          // check if we need to reset some values in the existing copy
+          if ((resetMode & RESET_SCORE) != 0) {
+            try {
+              scfilters.initialScore((UTF8)key, existing);
+            } catch (Exception e) {
+              LOG.warn("Couldn't filter initial score, key " + key + ":
" + e.getMessage());
+              existing.setScore(scoreInjected);
+            }
+          }
+          if ((resetMode & RESET_SCHEDULE) != 0) {
+            schedule.initializeSchedule((UTF8)key, existing);
+          }
+          if ((resetMode & RESET_METADATA) != 0) {
+            existing.setMetaData(new MapWritable());
+          }
+          output.collect(key, existing);
+        }
+      } else {
+        output.collect(key, injected);
+      }
     }
   }



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Howie Wang
Are enhancements being allowed into the 0.7 branch?

Howie

>FWIW, a patch with similar functionality is in my work-in-progress queue,  
>however it's for 0.8 - there is no point in backporting my patch because
>the architecture is very different...
>
>Here's a snippet:
>....
>


Reply | Threaded
Open this post in threaded view
|

Re: Removing or reindexing a URL?

Stefan Neufeind
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:

> Stefan Neufeind wrote:
>> How about making this a commandline-option to inject? Could you create
>> an improvement-patch?
>
> FWIW, a patch with similar functionality is in my work-in-progress
> queue,  however it's for 0.8 - there is no point in backporting my patch
> because the architecture is very different...
>
> Here's a snippet:
> ....

[...]

I'm fine with 0.8(-dev). Have been using it successful in production
myself now :-)

   Stefan