focused crawls -- where to add parse filter

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

focused crawls -- where to add parse filter

Brian Whitman
In doing whole-internet focused crawls we'd like a parse/injector  
filter.

Say we only want pages in our nutch db and index that have the word  
"nutch" in them. I'd like to express the rule as a lucene boolean  
query, contents:nutch, because in our real world scenario the match  
is more fuzzy and involves many phrases and terms. It's not just a  
regular expression.

If the query does not match or matches under a threshold score, I  
don't want to add the fetched/parsed document to the index, nor (more  
importantly) have the generator find outlinks from that page for  
future crawls.

This is somewhat like a url filter, but instead of filtering by url  
content I want to filter by parsed page content.

Where would I add this in nutch?

-Brian





Reply | Threaded
Open this post in threaded view
|

Re: focused crawls -- where to add parse filter

Dennis Kubes
You can use an HtmlParseFilter and then set a metadata attribute as to
whether or not it contains the phrase.  Problem with this is that all of
the content is still stored.  You could also change the
ParseOutputFormat to only write out if the word is contained although
that is a bit of a hack.

This may be an area that we need to add an extension point to if one
doesn't already exist.  I am sure there are many more people out there
that would like to selectively store content based on the content.

Dennis Kubes

Brian Whitman wrote:

> In doing whole-internet focused crawls we'd like a parse/injector filter.
>
> Say we only want pages in our nutch db and index that have the word
> "nutch" in them. I'd like to express the rule as a lucene boolean query,
> contents:nutch, because in our real world scenario the match is more
> fuzzy and involves many phrases and terms. It's not just a regular
> expression.
>
> If the query does not match or matches under a threshold score, I don't
> want to add the fetched/parsed document to the index, nor (more
> importantly) have the generator find outlinks from that page for future
> crawls.
>
> This is somewhat like a url filter, but instead of filtering by url
> content I want to filter by parsed page content.
>
> Where would I add this in nutch?
>
> -Brian
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: focused crawls -- where to add parse filter

Brian Whitman
On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:

> You can use an HtmlParseFilter and then set a metadata attribute as  
> to whether or not it contains the phrase.  Problem with this is  
> that all of the content is still stored.  You could also change the  
> ParseOutputFormat to only write out if the word is contained  
> although that is a bit of a hack.

I'm not worried about a hack, our whole set up is very "der lauf der  
dinge" and one more plank won't matter much :) But after sending my  
question out, I realized that I would need to index the document  
anyway before being able to lucene query it for topicality. I don't  
mind having pages stored that don't match my query, but I really  
would rather the generator not get more outlinks from those pages.

So a simple fix would be something I can write or run after a crawl/
index cycle that can mark certain pages to not emit more URIs in the  
generator. It would query each page in an index and update some flag.  
But what is that flag and how can I get at it?

And more advanced and later on -- the generator has smarts to  
prioritize fetching by inlink counts-- is there something I can hack  
to "boost" outlink fetches based on the source page's content?  for  
example - I find a page that scores high on my lucene query after  
crawl/index gets done. I would want the generator to put all of its  
outlinks up top, even if there's not many inlinks to that page...  
would this be a "generator plugin?"

-Brian






Reply | Threaded
Open this post in threaded view
|

Re: focused crawls -- where to add parse filter

Dennis Kubes
If I understand what you are trying to do then here is how I would
approach it.

Write an HtmlParseFilter that sets an attribute in the ParseData
MetaData based on whether the page contains what you are looking for.
Then write another MR job that runs after the crawl/index cycle.  This
job would need to update the CrawlDatum MetaData based on your priority
calculation (inlinks and contains text, etc.).  Then hack the Generator
class around line 160 to change the sort value that it is using based on
the CrawlDatum MetaData.  I would make using this new sort value an
option that you can turn on and off by using different configuration values.

Hope this helps.

Dennis Kubes

Brian Whitman wrote:

> On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:
>
>> You can use an HtmlParseFilter and then set a metadata attribute as to
>> whether or not it contains the phrase.  Problem with this is that all
>> of the content is still stored.  You could also change the
>> ParseOutputFormat to only write out if the word is contained although
>> that is a bit of a hack.
>
> I'm not worried about a hack, our whole set up is very "der lauf der
> dinge" and one more plank won't matter much :) But after sending my
> question out, I realized that I would need to index the document anyway
> before being able to lucene query it for topicality. I don't mind having
> pages stored that don't match my query, but I really would rather the
> generator not get more outlinks from those pages.
>
> So a simple fix would be something I can write or run after a
> crawl/index cycle that can mark certain pages to not emit more URIs in
> the generator. It would query each page in an index and update some
> flag. But what is that flag and how can I get at it?
>
> And more advanced and later on -- the generator has smarts to prioritize
> fetching by inlink counts-- is there something I can hack to "boost"
> outlink fetches based on the source page's content?  for example - I
> find a page that scores high on my lucene query after crawl/index gets
> done. I would want the generator to put all of its outlinks up top, even
> if there's not many inlinks to that page... would this be a "generator
> plugin?"
>
> -Brian
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: focused crawls -- where to add parse filter

Doğacan Güney-3
In reply to this post by Brian Whitman
Hi,

On 2/17/07, Brian Whitman <[hidden email]> wrote:
>
> I'm not worried about a hack, our whole set up is very "der lauf der
> dinge" and one more plank won't matter much :) But after sending my
> question out, I realized that I would need to index the document
> anyway before being able to lucene query it for topicality. I don't
> mind having pages stored that don't match my query, but I really
> would rather the generator not get more outlinks from those pages.

How about an outlink filter that works during parse? In ParseOutputFormat,
it will take the parse text, parse data (etc.) of the source page and
the destination url then will either return "filter this outlink" or
"let it through".

>
> So a simple fix would be something I can write or run after a crawl/
> index cycle that can mark certain pages to not emit more URIs in the
> generator. It would query each page in an index and update some flag.
> But what is that flag and how can I get at it?
>
> And more advanced and later on -- the generator has smarts to
> prioritize fetching by inlink counts-- is there something I can hack
> to "boost" outlink fetches based on the source page's content?  for
> example - I find a page that scores high on my lucene query after
> crawl/index gets done. I would want the generator to put all of its
> outlinks up top, even if there's not many inlinks to that page...
> would this be a "generator plugin?"

You should be able to do this with a scoring plugin and a parse plugin.

Write a parse plugin (or update a current one) to analyze the content
and put the result in parse data's metadata(for example, put a
<"boost", "10"> pair in it). Then in
<your_scoring_filter>.distributeScoreToOutlink check if parse data's
metadata has the "boost" field and boost it accordingly. You may also
want to consider changing the indexerScore method to give it an even
higher boost.

>
> -Brian
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: focused crawls -- where to add parse filter

Brian Whitman

> How about an outlink filter that works during parse? In  
> ParseOutputFormat,
> it will take the parse text, parse data (etc.) of the source page and
> the destination url then will either return "filter this outlink" or
> "let it through".

> Write an HtmlParseFilter that sets an attribute in the ParseData  
> MetaData based on whether the page contains what you are looking  
> for. Then write another MR job that runs after the crawl/index  
> cycle.  This job would need to update the CrawlDatum MetaData based  
> on your priority calculation (inlinks and contains text, etc.).  
> Then hack the Generator class around line 160 to change the sort  
> value that it is using based on the CrawlDatum MetaData.  I would  
> make using this new sort value an option that you can turn on and  
> off by using different configuration values.

Hi Doğacan, Dennis:

Thanks for the ideas. I spent some time mentally planning out how to  
implement both of these ideas by looking at the source. I'm still  
newish to Nutch so excuse my naiveté.

Do either of these approaches let me get at the analyzed/indexed  
contents of the page text so that I can perform Lucene queries for  
filtering? What I could tell of the HtmlParseFilter or Parse in  
general is that it gets me at the parse tree, which i could do regexp  
queries on -- but I'd rather it be all in Lucene and be influenced by  
the relative ranking of terms amongst all documents. I am envisioning  
machine generated queries from our classifiers that might be hundreds  
of tokens long with boost values per term, and a score threshold. So  
I'd need to act on the documents post-index. Unless I'm reading your  
suggestions incorrectly, neither of them let me at that?


I am currently looking at PruneIndexTool -- could a modification of  
this work? I could run it after a crawl/index cycle but before  
invertlinks and the next generate. The one issue I see is that  
PruneIndexTool claims not to affect the WebDB. Does this mean that  
even though the lucene doc will be gone, the link and outlinks will  
remain in the WebDB and will be fetched anyway?

If I should instead be looking harder at your recommended  
HtmlParseFilter or ParseOutputFormat, please correct me.

-Brian

Reply | Threaded
Open this post in threaded view
|

Re: focused crawls -- where to add parse filter

Dennis Kubes
Brian Whitman wrote:

>
>> How about an outlink filter that works during parse? In
>> ParseOutputFormat,
>> it will take the parse text, parse data (etc.) of the source page and
>> the destination url then will either return "filter this outlink" or
>> "let it through".
>
>> Write an HtmlParseFilter that sets an attribute in the ParseData
>> MetaData based on whether the page contains what you are looking for.
>> Then write another MR job that runs after the crawl/index cycle.  This
>> job would need to update the CrawlDatum MetaData based on your
>> priority calculation (inlinks and contains text, etc.).  Then hack the
>> Generator class around line 160 to change the sort value that it is
>> using based on the CrawlDatum MetaData.  I would make using this new
>> sort value an option that you can turn on and off by using different
>> configuration values.
>
> Hi Doğacan, Dennis:
>
> Thanks for the ideas. I spent some time mentally planning out how to
> implement both of these ideas by looking at the source. I'm still newish
> to Nutch so excuse my naiveté.
>
> Do either of these approaches let me get at the analyzed/indexed
> contents of the page text so that I can perform Lucene queries for
> filtering? What I could tell of the HtmlParseFilter or Parse in general
> is that it gets me at the parse tree, which i could do regexp queries on
> -- but I'd rather it be all in Lucene and be influenced by the relative
> ranking of terms amongst all documents. I am envisioning machine
> generated queries from our classifiers that might be hundreds of tokens
> long with boost values per term, and a score threshold. So I'd need to
> act on the documents post-index. Unless I'm reading your suggestions
> incorrectly, neither of them let me at that?

You could drop the HtmlParseFilter part and simply write the post
crawl/index MR job after to update the CrawlDatum based on your lucene
queries.  You would still need to write the second part that does the
generation based on a different sort value.
>
> I am currently looking at PruneIndexTool -- could a modification of this
> work? I could run it after a crawl/index cycle but before invertlinks
> and the next generate. The one issue I see is that PruneIndexTool claims
> not to affect the WebDB. Does this mean that even though the lucene doc
> will be gone, the link and outlinks will remain in the WebDB and will be
> fetched anyway?

That is correct.  You will need to alter the CrawlDb to affect what is
generated and hence fetched.
>
> If I should instead be looking harder at your recommended
> HtmlParseFilter or ParseOutputFormat, please correct me.

No if you are doing complex queries instead of something like "if this
page contains words x, y, and z"  then I wouldn't do it through
HtmlParseFilter I would probably go with the lucene after index approach.

Dennis Kubes
>
> -Brian
>
Reply | Threaded
Open this post in threaded view
|

Re: focused crawls -- where to add parse filter

Doğacan Güney-3
On 2/19/07, Dennis Kubes <[hidden email]> wrote:

[snip]

>
> You could drop the HtmlParseFilter part and simply write the post
> crawl/index MR job after to update the CrawlDatum based on your lucene
> queries.  You would still need to write the second part that does the
> generation based on a different sort value.

The second part can be written with a different scoring plugin. Simply
put whatever it is you need in CrawlDatum's metadata then change
ScoringFilter.generatorSortValue to look up that value and give a
good/bad score.

[snip]

--
Doğacan Güney