[jira] Created: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
Efficient site-specific crawling for a large number of sites
------------------------------------------------------------

         Key: NUTCH-87
         URL: http://issues.apache.org/jira/browse/NUTCH-87
     Project: Nutch
        Type: New Feature
  Components: fetcher  
 Environment: cross-platform
 Reporter: AJ Chen


There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.

There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  

But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.

AJ Chen



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-----------------------------

    Attachment: JIRA-87-whitelistfilter.tar.gz

The attached tarball contains sources to two Java classes:

epile.crawl.plugin.whitelisturlfilter.WhitelistURLFilter.java
epile.crawl.whitelist.WhitelistWriter.java

License granted for inclusion in ASF works by Team Gigabyte, Inc.

Note that there is one known logic bug in WhitelistWriter, which I'll be fixing in my own codebase shortly. Updates will be posted here. :^)

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

Nutch-87 Setup

Michael Ji
hi Matt:

You nutch-87 has a good idea and I believe it provides
a solution for good size of controled domain, say
hundreds of thousands sites.

I am currently trying to implement it to Nutch 07.

Got several questions want to be clearified:

1)
Should I create two plug-in classes in nutch?

etc
one for "WhitelistURLFilter"
one for "WhitelistWriter

2)
I found Whitelist.java refer to
"import epile.util.LogLevel;"

And
WhitelistURLFilter.java refer to
"import epile.crawl.util.StringURL;
import epile.util.LogLevel;"

Are these new package existing in Nutch lib? If not,
should we import a new epile*.jar?

3)
If we want to use Nutch-87, should we change the code
in Nutch core code.

I plan to "replace" all the places where
RegexURLFilter appearing by WhitelistURLFilter.

Is it a right approach?

thanks,

Michael Ji,



       
               
______________________________________________________
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323157 ]

Matt Kangas commented on NUTCH-87:
----------------------------------

Sample edits to nutch-site.xml for use with this plugin:


<property>
  <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
  <value>false</value>
</property>

<property>
  <name>urlfilter.whitelist.file</name>
  <value>/var/epile/crawl/whitelist_map</value>
  <description>Name of file containing the location of the on-disk whitelist map directory.</description>
</property>

<property>
  <name>plugin.includes</name>
  <value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>

<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.net.RegexURLFilter epile.crawl.plugin.WhitelistURLFilter</value>
</property>


> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323158 ]

Matt Kangas commented on NUTCH-87:
----------------------------------

Sample plugin.xml file for use with WhitelistURLFilter

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="epile-whitelisturlfilter"
   name="Epile whitelist URL filter"
   version="1.0.0"
   provider-name="teamgigabyte.com">

   <extension-point
      id="org.apache.nutch.net.URLFilter"
      name="Nutch URL Filter"/>

   <runtime></runtime>

   <extension id="org.apache.nutch.net.urlfiler"
      name="Epile Whitelist URL Filter"
      point="org.apache.nutch.net.URLFilter">
             
      <implementation id="WhitelistURLFilter"
         class="epile.crawl.plugin.WhitelistURLFilter"/>                    
   </extension>
</plugin>

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

Re: Nutch-87 Setup

kangas
In reply to this post by Michael Ji
Hi Michael,

Only WhitelistURLFilter is a plugin class. WhitelistWriter is a  
utility for creating the on-disk hash used at fetch/inject time by  
WhitelistURLFilter. Sorry for the confusion. I will add a sample  
plugin.xml file to the ticket, which should help make things clearer.

Also, "epile.util.*" are our proprietary classes. LogLevel simply  
retrieves a value from a file other than nutch-site.xml. You can  
safely replace the references to epile.util.LogLevel with:

> import org.apache.nutch.util.LogFormatter;
> private static final Logger LOG = LogFormatter.getLogger
> (WhitelistURLFilter.class.getName());

StringURL is another utility class, probably not of high value. It  
just applies regexes to URL strings. The only references to it that I  
see are:

> $ grep StringURL WhitelistURLFilter.java
> import epile.crawl.util.StringURL;
>     String hostname = StringURL.extractHostname(url);
>       String strippedURL = StringURL.removeHostname(url);
>         String domain = StringURL.extractDomainFromHostname(hostname);
>       if (StringURL.isCGI(url))

extractHostname() and removeHostname() can be replaced with calls to  
java.net.URL.getHost() and getPath(), respectively. The other two are  
simple to replicate, and can probably be commented out for basic use.

Finally, to use this "new" plugin, you need to:

a) make sure a suitable directory is created under "plugins",  
including a plugin.xml and a jar with the WhitelistURLFilter class

b) modify your nutch-site.xml to include the new filter:

> <property>
>   <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
>   <value>false</value>
> </property>
>
> <property>
>   <name>urlfilter.whitelist.file</name>
>   <value>/var/epile/crawl/whitelist_map</value>
>   <description>Name of file containing the location of the on-disk  
> whitelist map directory.</description>
> </property>
>
> <property>
>   <name>plugin.includes</name>
>   <value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-
> (text|html)|index-basic|query-(basic|site|url)</value>
> </property>
>
> <property>
>   <name>urlfilter.order</name>
>   <value>org.apache.nutch.net.RegexURLFilter  
> epile.crawl.plugin.WhitelistURLFilter</value>
> </property>

c) run WhitelistWriter before attempting to fetch, so the filter has  
some rules to work with.

I may have left out a crucial step or two here (0.5 wink ;), so feel  
free to ask if anything seems unclear. I'll go update the ticket now  
to clarify these points.

--Matt


On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:

> hi Matt:
>
> You nutch-87 has a good idea and I believe it provides
> a solution for good size of controled domain, say
> hundreds of thousands sites.
>
> I am currently trying to implement it to Nutch 07.
>
> Got several questions want to be clearified:
>
> 1)
> Should I create two plug-in classes in nutch?
>
> etc
> one for "WhitelistURLFilter"
> one for "WhitelistWriter
>
> 2)
> I found Whitelist.java refer to
> "import epile.util.LogLevel;"
>
> And
> WhitelistURLFilter.java refer to
> "import epile.crawl.util.StringURL;
> import epile.util.LogLevel;"
>
> Are these new package existing in Nutch lib? If not,
> should we import a new epile*.jar?
>
> 3)
> If we want to use Nutch-87, should we change the code
> in Nutch core code.
>
> I plan to "replace" all the places where
> RegexURLFilter appearing by WhitelistURLFilter.
>
> Is it a right approach?
>
> thanks,
>
> Michael Ji,
>

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Nutch-87 Setup

Michael Ji
hi Matt:

I implemented and compiled your patch in Nutch 07
successfully.

However, I met a running problem, when I want to test
patch manually by calling its' class.

I edited bin/nutch and added line,
"
elif [ "$COMMAND" = WhitelistFilterTester ] ; then
  CLASS=epile.crawl.plugin.WhitelistURLFilter
"

But when I call it, give me error as
"
Exception in thread "main"
java.lang.NoClassDefFoundError: epile/crawl/plugin/Wh
itelistURLFilter
"

I guess the classpath is not defined properly.

My environment setting as followings:

1. nutch build.xml
adding "<ant dir="epile" target="deploy"/> "

2. nutch/src/plugin/
create dir of "epile-basic/src/java"
then copy unzip nutch-87 of epile/crawl.. to that dir

3. I created plugin.xml in epile-basic/, same as the
one you loaded in patch;
and a new build.xml of
"
<?xml version="1.0"?>

<project name="WhitelistURLFilter" default="jar">

  <import file="../build-plugin.xml"/>

</project>

"

4. In nutch, I can run "ant" successfully,
in nutch/build/, a new WhitelistURLFilter/ is created
and with WhitelistURLFilter.class inside;

Did I miss something important?

thanks,

Michael Ji

=====================================================
--- Matt Kangas <[hidden email]> wrote:

> Hi Michael,
>
> Only WhitelistURLFilter is a plugin class.
> WhitelistWriter is a  
> utility for creating the on-disk hash used at
> fetch/inject time by  
> WhitelistURLFilter. Sorry for the confusion. I will
> add a sample  
> plugin.xml file to the ticket, which should help
> make things clearer.
>
> Also, "epile.util.*" are our proprietary classes.
> LogLevel simply  
> retrieves a value from a file other than
> nutch-site.xml. You can  
> safely replace the references to epile.util.LogLevel
> with:
>
> > import org.apache.nutch.util.LogFormatter;
> > private static final Logger LOG =
> LogFormatter.getLogger
> > (WhitelistURLFilter.class.getName());
>
> StringURL is another utility class, probably not of
> high value. It  
> just applies regexes to URL strings. The only
> references to it that I  
> see are:
>
> > $ grep StringURL WhitelistURLFilter.java
> > import epile.crawl.util.StringURL;
> >     String hostname =
> StringURL.extractHostname(url);
> >       String strippedURL =
> StringURL.removeHostname(url);
> >         String domain =
> StringURL.extractDomainFromHostname(hostname);
> >       if (StringURL.isCGI(url))
>
> extractHostname() and removeHostname() can be
> replaced with calls to  
> java.net.URL.getHost() and getPath(), respectively.
> The other two are  
> simple to replicate, and can probably be commented
> out for basic use.
>
> Finally, to use this "new" plugin, you need to:
>
> a) make sure a suitable directory is created under
> "plugins",  
> including a plugin.xml and a jar with the
> WhitelistURLFilter class
>
> b) modify your nutch-site.xml to include the new
> filter:
>
> > <property>
> >  
>
<name>epile.crawl.whitelist.enableUndirectedCrawl</name>

> >   <value>false</value>
> > </property>
> >
> > <property>
> >   <name>urlfilter.whitelist.file</name>
> >   <value>/var/epile/crawl/whitelist_map</value>
> >   <description>Name of file containing the
> location of the on-disk  
> > whitelist map directory.</description>
> > </property>
> >
> > <property>
> >   <name>plugin.includes</name>
> >  
>
<value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-
>
> >
>
(text|html)|index-basic|query-(basic|site|url)</value>

> > </property>
> >
> > <property>
> >   <name>urlfilter.order</name>
> >   <value>org.apache.nutch.net.RegexURLFilter  
> > epile.crawl.plugin.WhitelistURLFilter</value>
> > </property>
>
> c) run WhitelistWriter before attempting to fetch,
> so the filter has  
> some rules to work with.
>
> I may have left out a crucial step or two here (0.5
> wink ;), so feel  
> free to ask if anything seems unclear. I'll go
> update the ticket now  
> to clarify these points.
>
> --Matt
>
>
> On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:
>
> > hi Matt:
> >
> > You nutch-87 has a good idea and I believe it
> provides
> > a solution for good size of controled domain, say
> > hundreds of thousands sites.
> >
> > I am currently trying to implement it to Nutch 07.
> >
> > Got several questions want to be clearified:
> >
> > 1)
> > Should I create two plug-in classes in nutch?
> >
> > etc
> > one for "WhitelistURLFilter"
> > one for "WhitelistWriter
> >
> > 2)
> > I found Whitelist.java refer to
> > "import epile.util.LogLevel;"
> >
> > And
> > WhitelistURLFilter.java refer to
> > "import epile.crawl.util.StringURL;
> > import epile.util.LogLevel;"
> >
> > Are these new package existing in Nutch lib? If
> not,
> > should we import a new epile*.jar?
> >
> > 3)
> > If we want to use Nutch-87, should we change the
> code
> > in Nutch core code.
> >
> > I plan to "replace" all the places where
> > RegexURLFilter appearing by WhitelistURLFilter.
> >
> > Is it a right approach?
> >
> > thanks,
> >
> > Michael Ji,
> >
>
> --
> Matt Kangas / [hidden email]
>
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Nutch-87 Setup

kangas
Hi Michael,

Ordinarily there's no need to edit bin/nutch to run a specific class.  
If the class is in a JAR in <nutch-home>/lib, you can just say "nutch  
<full class name>". For example, the following two commands are  
equivalent:

$ nutch crawl
$ nutch org.apache.nutch.tools.CrawlTool

However, the situation is a little different for plugins. Ordinarily  
the classes for a plugin are placed in <nutch-home>/plugins/<plugin-
name>, not <nutch-home>/lib. To instantiate the plugin class, you  
must *another* class which calls the appropriate plugin factory. For  
URLFilter plugins, the factory class is  
org.apache.nutch.net.URLFilters. This class does not have a main()  
method, but there is a helper class to test filters,  
URLFilterChecker. You can run it as follows:

$ nutch org.apache.nutch.net.URLFilterChecker -allCombined < urls.txt

Hope that helps. Let me know if that doesn't work for you.

--Matt

On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:

> hi Matt:
>
> I implemented and compiled your patch in Nutch 07
> successfully.
>
> However, I met a running problem, when I want to test
> patch manually by calling its' class.
>
> I edited bin/nutch and added line,
> "
> elif [ "$COMMAND" = WhitelistFilterTester ] ; then
>   CLASS=epile.crawl.plugin.WhitelistURLFilter
> "
>
> But when I call it, give me error as
> "
> Exception in thread "main"
> java.lang.NoClassDefFoundError: epile/crawl/plugin/Wh
> itelistURLFilter
> "
>
> I guess the classpath is not defined properly.
>
> My environment setting as followings:
>
> 1. nutch build.xml
> adding "<ant dir="epile" target="deploy"/> "
>
> 2. nutch/src/plugin/
> create dir of "epile-basic/src/java"
> then copy unzip nutch-87 of epile/crawl.. to that dir
>
> 3. I created plugin.xml in epile-basic/, same as the
> one you loaded in patch;
> and a new build.xml of
> "
> <?xml version="1.0"?>
>
> <project name="WhitelistURLFilter" default="jar">
>
>   <import file="../build-plugin.xml"/>
>
> </project>
>
> "
>
> 4. In nutch, I can run "ant" successfully,
> in nutch/build/, a new WhitelistURLFilter/ is created
> and with WhitelistURLFilter.class inside;
>
> Did I miss something important?
>
> thanks,
>
> Michael Ji
>
> =====================================================
> --- Matt Kangas <[hidden email]> wrote:
>
>
>> Hi Michael,
>>
>> Only WhitelistURLFilter is a plugin class.
>> WhitelistWriter is a
>> utility for creating the on-disk hash used at
>> fetch/inject time by
>> WhitelistURLFilter. Sorry for the confusion. I will
>> add a sample
>> plugin.xml file to the ticket, which should help
>> make things clearer.
>>
>> Also, "epile.util.*" are our proprietary classes.
>> LogLevel simply
>> retrieves a value from a file other than
>> nutch-site.xml. You can
>> safely replace the references to epile.util.LogLevel
>> with:
>>
>>
>>> import org.apache.nutch.util.LogFormatter;
>>> private static final Logger LOG =
>>>
>> LogFormatter.getLogger
>>
>>> (WhitelistURLFilter.class.getName());
>>>
>>
>> StringURL is another utility class, probably not of
>> high value. It
>> just applies regexes to URL strings. The only
>> references to it that I
>> see are:
>>
>>
>>> $ grep StringURL WhitelistURLFilter.java
>>> import epile.crawl.util.StringURL;
>>>     String hostname =
>>>
>> StringURL.extractHostname(url);
>>
>>>       String strippedURL =
>>>
>> StringURL.removeHostname(url);
>>
>>>         String domain =
>>>
>> StringURL.extractDomainFromHostname(hostname);
>>
>>>       if (StringURL.isCGI(url))
>>>
>>
>> extractHostname() and removeHostname() can be
>> replaced with calls to
>> java.net.URL.getHost() and getPath(), respectively.
>> The other two are
>> simple to replicate, and can probably be commented
>> out for basic use.
>>
>> Finally, to use this "new" plugin, you need to:
>>
>> a) make sure a suitable directory is created under
>> "plugins",
>> including a plugin.xml and a jar with the
>> WhitelistURLFilter class
>>
>> b) modify your nutch-site.xml to include the new
>> filter:
>>
>>
>>> <property>
>>>
>>>
>>
>>
> <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
>
>>>   <value>false</value>
>>> </property>
>>>
>>> <property>
>>>   <name>urlfilter.whitelist.file</name>
>>>   <value>/var/epile/crawl/whitelist_map</value>
>>>   <description>Name of file containing the
>>>
>> location of the on-disk
>>
>>> whitelist map directory.</description>
>>> </property>
>>>
>>> <property>
>>>   <name>plugin.includes</name>
>>>
>>>
>>
>>
> <value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-
>
>>
>>
>>>
>>>
>>
>>
> (text|html)|index-basic|query-(basic|site|url)</value>
>
>>> </property>
>>>
>>> <property>
>>>   <name>urlfilter.order</name>
>>>   <value>org.apache.nutch.net.RegexURLFilter
>>> epile.crawl.plugin.WhitelistURLFilter</value>
>>> </property>
>>>
>>
>> c) run WhitelistWriter before attempting to fetch,
>> so the filter has
>> some rules to work with.
>>
>> I may have left out a crucial step or two here (0.5
>> wink ;), so feel
>> free to ask if anything seems unclear. I'll go
>> update the ticket now
>> to clarify these points.
>>
>> --Matt
>>
>>
>> On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:
>>
>>
>>> hi Matt:
>>>
>>> You nutch-87 has a good idea and I believe it
>>>
>> provides
>>
>>> a solution for good size of controled domain, say
>>> hundreds of thousands sites.
>>>
>>> I am currently trying to implement it to Nutch 07.
>>>
>>> Got several questions want to be clearified:
>>>
>>> 1)
>>> Should I create two plug-in classes in nutch?
>>>
>>> etc
>>> one for "WhitelistURLFilter"
>>> one for "WhitelistWriter
>>>
>>> 2)
>>> I found Whitelist.java refer to
>>> "import epile.util.LogLevel;"
>>>
>>> And
>>> WhitelistURLFilter.java refer to
>>> "import epile.crawl.util.StringURL;
>>> import epile.util.LogLevel;"
>>>
>>> Are these new package existing in Nutch lib? If
>>>
>> not,
>>
>>> should we import a new epile*.jar?
>>>
>>> 3)
>>> If we want to use Nutch-87, should we change the
>>>
>> code
>>
>>> in Nutch core code.
>>>
>>> I plan to "replace" all the places where
>>> RegexURLFilter appearing by WhitelistURLFilter.
>>>
>>> Is it a right approach?
>>>
>>> thanks,
>>>
>>> Michael Ji,
>>>
>>>
>>
>> --
>> Matt Kangas / [hidden email]
>>
>>
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Nutch-87 Setup

Michael Ji
hi Matt:

Thanks your advice.

I can trigger URLFilterChecker successfully, however,
get the following error, complain about index filter.
Could you let me know where the problem will be?

"
050921 191015 impl:
point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter

050921 191015 not including:
E:\programs\cygwin\home\fji\versionControl\nutch_V07_P87\nutch\build\plugins\WhitelistURLFilter

050921 191015 SEVERE
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
Exception in thread "main"
java.lang.ExceptionInInitializerError
        at
org.apache.nutch.net.URLFilterChecker.checkAll(URLFilterChecker.java:93)
        at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
Caused by: java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
        at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:147)
        at
org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
        ... 2 more
Caused by:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
        at
org.apache.nutch.plugin.PluginRepository.installExtensions(PluginRepository.java:78)
        at
org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:61)
        at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:144)
        ... 3 more
"

thanks,

Michael Ji


--- Matt Kangas <[hidden email]> wrote:

> Hi Michael,
>
> Ordinarily there's no need to edit bin/nutch to run
> a specific class.  
> If the class is in a JAR in <nutch-home>/lib, you
> can just say "nutch  
> <full class name>". For example, the following two
> commands are  
> equivalent:
>
> $ nutch crawl
> $ nutch org.apache.nutch.tools.CrawlTool
>
> However, the situation is a little different for
> plugins. Ordinarily  
> the classes for a plugin are placed in
> <nutch-home>/plugins/<plugin-
> name>, not <nutch-home>/lib. To instantiate the
> plugin class, you  
> must *another* class which calls the appropriate
> plugin factory. For  
> URLFilter plugins, the factory class is  
> org.apache.nutch.net.URLFilters. This class does not
> have a main()  
> method, but there is a helper class to test filters,
>  
> URLFilterChecker. You can run it as follows:
>
> $ nutch org.apache.nutch.net.URLFilterChecker
> -allCombined < urls.txt
>
> Hope that helps. Let me know if that doesn't work
> for you.
>
> --Matt
>
> On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:
>
> > hi Matt:
> >
> > I implemented and compiled your patch in Nutch 07
> > successfully.
> >
> > However, I met a running problem, when I want to
> test
> > patch manually by calling its' class.
> >
> > I edited bin/nutch and added line,
> > "
> > elif [ "$COMMAND" = WhitelistFilterTester ] ; then
> >   CLASS=epile.crawl.plugin.WhitelistURLFilter
> > "
> >
> > But when I call it, give me error as
> > "
> > Exception in thread "main"
> > java.lang.NoClassDefFoundError:
> epile/crawl/plugin/Wh
> > itelistURLFilter
> > "
> >
> > I guess the classpath is not defined properly.
> >
> > My environment setting as followings:
> >
> > 1. nutch build.xml
> > adding "<ant dir="epile" target="deploy"/> "
> >
> > 2. nutch/src/plugin/
> > create dir of "epile-basic/src/java"
> > then copy unzip nutch-87 of epile/crawl.. to that
> dir
> >
> > 3. I created plugin.xml in epile-basic/, same as
> the
> > one you loaded in patch;
> > and a new build.xml of
> > "
> > <?xml version="1.0"?>
> >
> > <project name="WhitelistURLFilter" default="jar">
> >
> >   <import file="../build-plugin.xml"/>
> >
> > </project>
> >
> > "
> >
> > 4. In nutch, I can run "ant" successfully,
> > in nutch/build/, a new WhitelistURLFilter/ is
> created
> > and with WhitelistURLFilter.class inside;
> >
> > Did I miss something important?
> >
> > thanks,
> >
> > Michael Ji
> >
> >
>
=====================================================

> > --- Matt Kangas <[hidden email]> wrote:
> >
> >
> >> Hi Michael,
> >>
> >> Only WhitelistURLFilter is a plugin class.
> >> WhitelistWriter is a
> >> utility for creating the on-disk hash used at
> >> fetch/inject time by
> >> WhitelistURLFilter. Sorry for the confusion. I
> will
> >> add a sample
> >> plugin.xml file to the ticket, which should help
> >> make things clearer.
> >>
> >> Also, "epile.util.*" are our proprietary classes.
> >> LogLevel simply
> >> retrieves a value from a file other than
> >> nutch-site.xml. You can
> >> safely replace the references to
> epile.util.LogLevel
> >> with:
> >>
> >>
> >>> import org.apache.nutch.util.LogFormatter;
> >>> private static final Logger LOG =
> >>>
> >> LogFormatter.getLogger
> >>
> >>> (WhitelistURLFilter.class.getName());
> >>>
> >>
> >> StringURL is another utility class, probably not
> of
> >> high value. It
> >> just applies regexes to URL strings. The only
> >> references to it that I
> >> see are:
> >>
> >>
> >>> $ grep StringURL WhitelistURLFilter.java
> >>> import epile.crawl.util.StringURL;
> >>>     String hostname =
> >>>
> >> StringURL.extractHostname(url);
> >>
> >>>       String strippedURL =
> >>>
> >> StringURL.removeHostname(url);
> >>
> >>>         String domain =
> >>>
> >> StringURL.extractDomainFromHostname(hostname);
> >>
> >>>       if (StringURL.isCGI(url))
> >>>
> >>
> >> extractHostname() and removeHostname() can be
> >> replaced with calls to
> >> java.net.URL.getHost() and getPath(),
> respectively.
> >> The other two are
> >> simple to replicate, and can probably be
> commented
> >> out for basic use.
> >>
> >> Finally, to use this "new" plugin, you need to:
> >>
> >> a) make sure a suitable directory is created
> under
> >> "plugins",
> >> including a plugin.xml and a jar with the
> >> WhitelistURLFilter class
> >>
> >> b) modify your nutch-site.xml to include the new
> >> filter:
> >>
> >>
> >>> <property>
> >>>
> >>>
> >>
> >>
> >
>
<name>epile.crawl.whitelist.enableUndirectedCrawl</name>

> >
> >>>   <value>false</value>
> >>> </property>
> >>>
> >>> <property>
> >>>   <name>urlfilter.whitelist.file</name>
> >>>   <value>/var/epile/crawl/whitelist_map</value>
> >>>   <description>Name of file containing the
> >>>
> >> location of the on-disk
> >>
> >>> whitelist map directory.</description>
> >>> </property>
> >>>
> >>> <property>
>
=== message truncated ===



               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

Re: Nutch-87 Setup

kangas
Michael, this looks like an error in your Nutch configuration, or  
possibly your CLASSPATH. I'd guess it's the former. Take a look at  
the following nutch-site.xml (or nutch-default) properties, and make  
sure they reference (a) the right place on disk, (b) plugins that  
actually exist:

- plugin.folders
- plugin.includes
- urlfilter.order

If you're still stuck, email me privately and we'll try to work  
through this.

--Matt

On Sep 13, 2005, at 7:14 PM, Michael Ji wrote:

> hi Matt:
>
> Thanks your advice.
>
> I can trigger URLFilterChecker successfully, however,
> get the following error, complain about index filter.
> Could you let me know where the problem will be?
>
> "
> 050921 191015 impl:
> point=org.apache.nutch.net.URLFilter
> class=org.apache.nutch.net.RegexURLFilter
>
> 050921 191015 not including:
> E:\programs\cygwin\home\fji\versionControl\nutch_V07_P87\nutch\build
> \plugins\WhitelistURLFilter
>
> 050921 191015 SEVERE
> org.apache.nutch.plugin.PluginRuntimeException:
> extension point:
> org.apache.nutch.indexer.IndexingFilter does not
> exist.
> Exception in thread "main"
> java.lang.ExceptionInInitializerError
>     at
> org.apache.nutch.net.URLFilterChecker.checkAll
> (URLFilterChecker.java:93)
>     at
> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
> Caused by: java.lang.RuntimeException:
> org.apache.nutch.plugin.PluginRuntimeException:
> extension point:
> org.apache.nutch.indexer.IndexingFilter does not
> exist.
>     at
> org.apache.nutch.plugin.PluginRepository.getInstance
> (PluginRepository.java:147)
>     at
> org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
>     ... 2 more
> Caused by:
> org.apache.nutch.plugin.PluginRuntimeException:
> extension point:
> org.apache.nutch.indexer.IndexingFilter does not
> exist.
>     at
> org.apache.nutch.plugin.PluginRepository.installExtensions
> (PluginRepository.java:78)
>     at
> org.apache.nutch.plugin.PluginRepository.<init>
> (PluginRepository.java:61)
>     at
> org.apache.nutch.plugin.PluginRepository.getInstance
> (PluginRepository.java:144)
>     ... 3 more
> "
>
> thanks,
>
> Michael Ji
>
>
> --- Matt Kangas <[hidden email]> wrote:
>
>
>> Hi Michael,
>>
>> Ordinarily there's no need to edit bin/nutch to run
>> a specific class.
>> If the class is in a JAR in <nutch-home>/lib, you
>> can just say "nutch
>> <full class name>". For example, the following two
>> commands are
>> equivalent:
>>
>> $ nutch crawl
>> $ nutch org.apache.nutch.tools.CrawlTool
>>
>> However, the situation is a little different for
>> plugins. Ordinarily
>> the classes for a plugin are placed in
>> <nutch-home>/plugins/<plugin-
>> name>, not <nutch-home>/lib. To instantiate the
>> plugin class, you
>> must *another* class which calls the appropriate
>> plugin factory. For
>> URLFilter plugins, the factory class is
>> org.apache.nutch.net.URLFilters. This class does not
>> have a main()
>> method, but there is a helper class to test filters,
>>
>> URLFilterChecker. You can run it as follows:
>>
>> $ nutch org.apache.nutch.net.URLFilterChecker
>> -allCombined < urls.txt
>>
>> Hope that helps. Let me know if that doesn't work
>> for you.
>>
>> --Matt
>>
>> On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:
>>
>>
>>> hi Matt:
>>>
>>> I implemented and compiled your patch in Nutch 07
>>> successfully.
>>>
>>> However, I met a running problem, when I want to
>>>
>> test
>>
>>> patch manually by calling its' class.
>>>
>>> I edited bin/nutch and added line,
>>> "
>>> elif [ "$COMMAND" = WhitelistFilterTester ] ; then
>>>   CLASS=epile.crawl.plugin.WhitelistURLFilter
>>> "
>>>
>>> But when I call it, give me error as
>>> "
>>> Exception in thread "main"
>>> java.lang.NoClassDefFoundError:
>>>
>> epile/crawl/plugin/Wh
>>
>>> itelistURLFilter
>>> "
>>>
>>> I guess the classpath is not defined properly.
>>>
>>> My environment setting as followings:
>>>
>>> 1. nutch build.xml
>>> adding "<ant dir="epile" target="deploy"/> "
>>>
>>> 2. nutch/src/plugin/
>>> create dir of "epile-basic/src/java"
>>> then copy unzip nutch-87 of epile/crawl.. to that
>>>
>> dir
>>
>>>
>>> 3. I created plugin.xml in epile-basic/, same as
>>>
>> the
>>
>>> one you loaded in patch;
>>> and a new build.xml of
>>> "
>>> <?xml version="1.0"?>
>>>
>>> <project name="WhitelistURLFilter" default="jar">
>>>
>>>   <import file="../build-plugin.xml"/>
>>>
>>> </project>
>>>
>>> "
>>>
>>> 4. In nutch, I can run "ant" successfully,
>>> in nutch/build/, a new WhitelistURLFilter/ is
>>>
>> created
>>
>>> and with WhitelistURLFilter.class inside;
>>>
>>> Did I miss something important?
>>>
>>> thanks,
>>>
>>> Michael Ji
>>>
>>>
>>>
>>
>>

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12361660 ]

Neal Whitley commented on NUTCH-87:
-----------------------------------

Matt,

Is there a how-to or tutorial on how to get this plugin up and running?  I am running into problems (probably mine) on the integration of this.  Thanks!

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-----------------------------

    Attachment: build.xml.patch
                urlfilter-whitelist.tar.gz

THIS REPLACES THE PREVIOUS TARBALL
SEE THE INCLUDED README.txt FOR USAGE GUIDELINES

Place both of these files into ~nutch/src/plugin, then:
- untar the tarball
- apply the patch to ~nutch/src/plugin/build.xml to permit urifilter-whitelist to be built

Next, cd ~nutch and build ("ant").

A JUnit test is included. It will be run automatically by "ant test-plugins".

Then follow the instructions in ~nutch/src/plugin/urlfilter-whitelist/README.txt

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12362584 ]

Matt Kangas commented on NUTCH-87:
----------------------------------

JIRA-87-whitelistfilter.tar.gz is OBSOLETE. Use the newer tarball + patch file instead.

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-----------------------------

    Version: 0.7.2-dev
             0.8-dev

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-----------------------------

    Attachment: build.xml.patch-0.8

The previous patch file is valid for 0.7. Here is one that works for 0.8-dev (trunk).

(It's three separate one-line additions, to include the plugin in the "deploy", "test" , and "clean" targets.)

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, build.xml.patch-0.8, urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 100000 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that "nutch crawl" command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later.
> There is a discussion about adding a URLFilter to implement this requested feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira