Re: [Nutch-dev] Creating a new scoring filter

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Lorenzo-27
Hi,
sorry to re-open this thread, but I am facing the same problem of Nicolás.
I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
abstract
classes are not good extension points.
Anyway, is any of these implemented? I really need it!
Also, I can't understand from the docs what does it means that the
adjust datum
will update the score of the original datum in updatedb.
Update or adjusted in which way? I obtain strange values..

Thanks!

Lorenzo



> Hi,
> On 2/27/07, Nicolás Lichtmaier <nick@relo...
> <http://www.opensubscriber.com/sendEmail.os?message=6159544&inline=0>>
> wrote:
> [snip]
> >
> > It doesn't seem a good way to do it. What if there are no outlinks?
> This
> > method won't be called at all. And anyway, it would be called once per
> > each outlink, which would multiplicate the work.
> Multiplication is easy to solve but you are right that it won't work
> if there are no outlinks.
> Maybe scoring filter api should change? A distributeScoreToOutlinks
> method may be more useful than the current one: (which will be called
> even if there are no outlinks)
> CrawlDatum distributeScoreToOutlinks(Text fromUrl, List<String>
> toUrlList, List<CrawlDatum> datumList, ParseData parseData,
> CrawlDatum adjust)
> This method gives more control to the plugin since knowing all the
> outlinks the plugin can make more informed decisions. Like, right now,
> there is no way a scoring filter can be sure that it has distributed
> all its cash (e.g if db.score.internal.link is 0.5 and
> db.score.external.link is 1.0, filter will almost always distribute
> less than its cash).
> This will also work for your case, since you will just ignore the
> outlinks and return the adjust datum based on information in parse
> metadata.
> What do you (and others) think?
> >
> > Thanks!
> >
> >
> --
> Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Nicol�Lichtmaier

> sorry to re-open this thread, but I am facing the same problem of
> Nicolás.
> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
> abstract
> classes are not good extension points.

That wasn't what I had proposed. My suggestion was to use an interface,
as always, but made this API real clean, expressing the minimum the rest
of the code needs from a scoring plugin, removing assumptions about its
implementation. Then I've proposed to have an abstract class,
implementing this interface, with a skeleton for any class which works
"distributing score to outlinks". So we would have the best of both
worlds: People creating new "PageRank" algorithms wouldn't need to
reimplement anuything, they would just subclass the abstract class. And
people like you and me would directly implement the interface (or use a
different abstract class if there's common logic to share). My boss put
all of this on hold, but I'd like to implement this idea in a near
future and try to have it included in Nutch.

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Doğacan Güney-3
In reply to this post by Lorenzo-27
On 4/19/07, Lorenzo <[hidden email]> wrote:
>
> Hi,
> sorry to re-open this thread, but I am facing the same problem of Nicolás.
> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
> abstract
> classes are not good extension points.
> Anyway, is any of these implemented? I really need it!


Well, I have implemented a subset of what we discussed in
<https://issues.apache.org/jira/browse/NUTCH-468>
NUTCH-468 <https://issues.apache.org/jira/browse/NUTCH-468>. There is a lot
more to be done but IMHO, NUTCH-468 may be a good starting point.

Also, I can't understand from the docs what does it means that the
> adjust datum
> will update the score of the original datum in updatedb.
> Update or adjusted in which way? I obtain strange values..


In ScoringFilter.updateDbScore you get a list of inlinked datums that you
can use to change score. Now, if in distributeScoreToOutlink(s) you return a
datum with a status of STATUS_LINKED, you will get this datum as one of the
inlinked datums in updateDbScore.

I hope, this clears it up a bit.

Thanks!

>
> Lorenzo
>
>
>
> > Hi,
> > On 2/27/07, Nicolás Lichtmaier <nick@relo...
> > <http://www.opensubscriber.com/sendEmail.os?message=6159544&inline=0>>
> > wrote:
> > [snip]
> > >
> > > It doesn't seem a good way to do it. What if there are no outlinks?
> > This
> > > method won't be called at all. And anyway, it would be called once per
> > > each outlink, which would multiplicate the work.
> > Multiplication is easy to solve but you are right that it won't work
> > if there are no outlinks.
> > Maybe scoring filter api should change? A distributeScoreToOutlinks
> > method may be more useful than the current one: (which will be called
> > even if there are no outlinks)
> > CrawlDatum distributeScoreToOutlinks(Text fromUrl, List<String>
> > toUrlList, List<CrawlDatum> datumList, ParseData parseData,
> > CrawlDatum adjust)
> > This method gives more control to the plugin since knowing all the
> > outlinks the plugin can make more informed decisions. Like, right now,
> > there is no way a scoring filter can be sure that it has distributed
> > all its cash (e.g if db.score.internal.link is 0.5 and
> > db.score.external.link is 1.0, filter will almost always distribute
> > less than its cash).
> > This will also work for your case, since you will just ignore the
> > outlinks and return the adjust datum based on information in parse
> > metadata.
> > What do you (and others) think?
> > >
> > > Thanks!
> > >
> > >
> > --
> > Doğacan Güney
>



--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Lorenzo-27
In reply to this post by Nicol�Lichtmaier
Sorry, I misunderstood your intentions.
Now I can see the advantages of your approach: a developer has to
implement the whole interface only if he/she needs to have more control
over some features.
This sounds great to me!

Lorenzo


Nicolás Lichtmaier wrote:

>
>> sorry to re-open this thread, but I am facing the same problem of
>> Nicolás.
>> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
>> abstract
>> classes are not good extension points.
>
> That wasn't what I had proposed. My suggestion was to use an
> interface, as always, but made this API real clean, expressing the
> minimum the rest of the code needs from a scoring plugin, removing
> assumptions about its implementation. Then I've proposed to have an
> abstract class, implementing this interface, with a skeleton for any
> class which works "distributing score to outlinks". So we would have
> the best of both worlds: People creating new "PageRank" algorithms
> wouldn't need to reimplement anuything, they would just subclass the
> abstract class. And people like you and me would directly implement
> the interface (or use a different abstract class if there's common
> logic to share). My boss put all of this on hold, but I'd like to
> implement this idea in a near future and try to have it included in
> Nutch.
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Lorenzo-27
In reply to this post by Doğacan Güney-3
Doğacan Güney wrote:

> On 4/19/07, Lorenzo <[hidden email]> wrote:
>>
>> Hi,
>> sorry to re-open this thread, but I am facing the same problem of
>> Nicolás.
>> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
>> abstract
>> classes are not good extension points.
>> Anyway, is any of these implemented? I really need it!
>
>
> Well, I have implemented a subset of what we discussed in
> <https://issues.apache.org/jira/browse/NUTCH-468>
> NUTCH-468 <https://issues.apache.org/jira/browse/NUTCH-468>. There is
> a lot
> more to be done but IMHO, NUTCH-468 may be a good starting point.
>
> Also, I can't understand from the docs what does it means that the
>> adjust datum
>> will update the score of the original datum in updatedb.
>> Update or adjusted in which way? I obtain strange values..
>
>
> In ScoringFilter.updateDbScore you get a list of inlinked datums that you
> can use to change score. Now, if in distributeScoreToOutlink(s) you
> return a
> datum with a status of STATUS_LINKED, you will get this datum as one
> of the
> inlinked datums in updateDbScore.
>
> I hope, this clears it up a bit.
>
Uhmm... so, suppose I decided, from its content, that the current page
http://foo/bar.htm is really desiderable.
I have put in ParseData's metadata a flag to mark it.
In distributeScoreToOutlink(s) I read it from the ParseData param, and
put it in the adjust CrawlData metadata

      MapWritable adjustMap = adjust.getMetaData();
      adjustMap.put(key, new FloatWritable(bootsValue));
      return adjust;

So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
inlinked)
the adjust CrawlData will be between the inlinked List. Is it right? How
do I distinguish it?
I can put the URL in metadata too, and scroll through the list, but
maybe there is a better method?

Also, this CrawlDatum will be the same that is passed to indexerScore?
Thanks a lot!

Lorenzo

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Doğacan Güney-3
On 4/21/07, Lorenzo <[hidden email]> wrote:

>
> Doğacan Güney wrote:
> > On 4/19/07, Lorenzo <[hidden email]> wrote:
> >>
> >> Hi,
> >> sorry to re-open this thread, but I am facing the same problem of
> >> Nicolás.
> >> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
> >> abstract
> >> classes are not good extension points.
> >> Anyway, is any of these implemented? I really need it!
> >
> >
> > Well, I have implemented a subset of what we discussed in
> > <https://issues.apache.org/jira/browse/NUTCH-468>
> > NUTCH-468 <https://issues.apache.org/jira/browse/NUTCH-468>. There is
> > a lot
> > more to be done but IMHO, NUTCH-468 may be a good starting point.
> >
> > Also, I can't understand from the docs what does it means that the
> >> adjust datum
> >> will update the score of the original datum in updatedb.
> >> Update or adjusted in which way? I obtain strange values..
> >
> >
> > In ScoringFilter.updateDbScore you get a list of inlinked datums that
> you
> > can use to change score. Now, if in distributeScoreToOutlink(s) you
> > return a
> > datum with a status of STATUS_LINKED, you will get this datum as one
> > of the
> > inlinked datums in updateDbScore.
> >
> > I hope, this clears it up a bit.
> >
> Uhmm... so, suppose I decided, from its content, that the current page
> http://foo/bar.htm is really desiderable.
> I have put in ParseData's metadata a flag to mark it.
> In distributeScoreToOutlink(s) I read it from the ParseData param, and
> put it in the adjust CrawlData metadata
>
>       MapWritable adjustMap = adjust.getMetaData();
>       adjustMap.put(key, new FloatWritable(bootsValue));
>       return adjust;
>
> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
> inlinked)
> the adjust CrawlData will be between the inlinked List. Is it right? How
> do I distinguish it?
> I can put the URL in metadata too, and scroll through the list, but
> maybe there is a better method?



Best approach is yours, you should put a flag in adjust datum's metadata to
mark it, then process it in updateDbScore.

Also, this CrawlDatum will be the same that is passed to indexerScore?


You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is the one
in crawl_fetch that contains the fetching status. Second is dbDatum which
comes from crawldb. This dbDatum is the one that you set in
updateDbScore(The 'datum' argument of updateDbScore)


Thanks a lot!
>
> Lorenzo
>
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Lorenzo-27
Perfect! Now I have it working, and it performs quite well for a focused
serch engine like ours!
Do you think it could be an interesting plug-in to add to nutch?

Lorenzo


Doğacan Güney wrote:

> On 4/21/07, Lorenzo <[hidden email]> wrote:
>>
>> Uhmm... so, suppose I decided, from its content, that the current page
>> http://foo/bar.htm is really desiderable.
>> I have put in ParseData's metadata a flag to mark it.
>> In distributeScoreToOutlink(s) I read it from the ParseData param, and
>> put it in the adjust CrawlData metadata
>>
>>       MapWritable adjustMap = adjust.getMetaData();
>>       adjustMap.put(key, new FloatWritable(bootsValue));
>>       return adjust;
>>
>> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
>> inlinked)
>> the adjust CrawlData will be between the inlinked List. Is it right? How
>> do I distinguish it?
>> I can put the URL in metadata too, and scroll through the list, but
>> maybe there is a better method?
>
>
>
> Best approach is yours, you should put a flag in adjust datum's
> metadata to
> mark it, then process it in updateDbScore.
>
> Also, this CrawlDatum will be the same that is passed to indexerScore?
>
>
> You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
> the one
> in crawl_fetch that contains the fetching status. Second is dbDatum which
> comes from crawldb. This dbDatum is the one that you set in
> updateDbScore(The 'datum' argument of updateDbScore)
>
>
> Thanks a lot!
>>
>> Lorenzo
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Briggs
Yes.  I too need to alter the score based on attributes and such of
the particular url passed.
May I ask what you have done?


On 4/22/07, Lorenzo <[hidden email]> wrote:

> Perfect! Now I have it working, and it performs quite well for a focused
> serch engine like ours!
> Do you think it could be an interesting plug-in to add to nutch?
>
> Lorenzo
>
>
> Doğacan Güney wrote:
> > On 4/21/07, Lorenzo <[hidden email]> wrote:
> >>
> >> Uhmm... so, suppose I decided, from its content, that the current page
> >> http://foo/bar.htm is really desiderable.
> >> I have put in ParseData's metadata a flag to mark it.
> >> In distributeScoreToOutlink(s) I read it from the ParseData param, and
> >> put it in the adjust CrawlData metadata
> >>
> >>       MapWritable adjustMap = adjust.getMetaData();
> >>       adjustMap.put(key, new FloatWritable(bootsValue));
> >>       return adjust;
> >>
> >> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
> >> inlinked)
> >> the adjust CrawlData will be between the inlinked List. Is it right? How
> >> do I distinguish it?
> >> I can put the URL in metadata too, and scroll through the list, but
> >> maybe there is a better method?
> >
> >
> >
> > Best approach is yours, you should put a flag in adjust datum's
> > metadata to
> > mark it, then process it in updateDbScore.
> >
> > Also, this CrawlDatum will be the same that is passed to indexerScore?
> >
> >
> > You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
> > the one
> > in crawl_fetch that contains the fetching status. Second is dbDatum which
> > comes from crawldb. This dbDatum is the one that you set in
> > updateDbScore(The 'datum' argument of updateDbScore)
> >
> >
> > Thanks a lot!
> >>
> >> Lorenzo
> >>
> >>
> >
> >
>
>


--
"Conscious decisions by concious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Creating a new scoring filter

Lorenzo-27
Very briefly, with an HtmlParseFilter and a list of weighted words.
This filter examines the Parse text and add a boost value if it finds
one of the words in the list.
This boost value is added to ParseData MetaData.
Then, a ScoringPlugin reads this MetaData (passScoreAfterParsing) and
update the CrawlData, both of outlinked pages (to focus more the search)
and of the current page (the difficult part, as explained in the ml;
however, with NUTCH-468 it should be easyer now)

If you need other informations, please ask!

Lorenzo


Briggs wrote:

> Yes.  I too need to alter the score based on attributes and such of
> the particular url passed.
> May I ask what you have done?
>
>
> On 4/22/07, Lorenzo <[hidden email]> wrote:
>> Perfect! Now I have it working, and it performs quite well for a focused
>> serch engine like ours!
>> Do you think it could be an interesting plug-in to add to nutch?
>>
>> Lorenzo
>>
>>
>> Doğacan Güney wrote:
>> > On 4/21/07, Lorenzo <[hidden email]> wrote:
>> >>
>> >> Uhmm... so, suppose I decided, from its content, that the current
>> page
>> >> http://foo/bar.htm is really desiderable.
>> >> I have put in ParseData's metadata a flag to mark it.
>> >> In distributeScoreToOutlink(s) I read it from the ParseData param,
>> and
>> >> put it in the adjust CrawlData metadata
>> >>
>> >>       MapWritable adjustMap = adjust.getMetaData();
>> >>       adjustMap.put(key, new FloatWritable(bootsValue));
>> >>       return adjust;
>> >>
>> >> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
>> >> inlinked)
>> >> the adjust CrawlData will be between the inlinked List. Is it
>> right? How
>> >> do I distinguish it?
>> >> I can put the URL in metadata too, and scroll through the list, but
>> >> maybe there is a better method?
>> >
>> >
>> >
>> > Best approach is yours, you should put a flag in adjust datum's
>> > metadata to
>> > mark it, then process it in updateDbScore.
>> >
>> > Also, this CrawlDatum will be the same that is passed to indexerScore?
>> >
>> >
>> > You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
>> > the one
>> > in crawl_fetch that contains the fetching status. Second is dbDatum
>> which
>> > comes from crawldb. This dbDatum is the one that you set in
>> > updateDbScore(The 'datum' argument of updateDbScore)
>> >
>> >
>> > Thanks a lot!
>> >>
>> >> Lorenzo
>> >>
>> >>
>> >
>> >
>>
>>
>
>