[jira] Created: (NUTCH-100) New plugin urlfilter-db

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-100) New plugin urlfilter-db

Tim Allison (Jira)
New plugin urlfilter-db
-----------------------

         Key: NUTCH-100
         URL: http://issues.apache.org/jira/browse/NUTCH-100
     Project: Nutch
        Type: New Feature
  Components: fetcher  
    Versions: 0.8-dev    
 Environment: MapRed
    Reporter: Gal Nitzan
    Priority: Trivial


Hi,

I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

Tim Allison (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-----------------------------

    Attachment: urlfilter-db.tar.gz

The plugin. Extract, and in myplugin folder read README

> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: MapRed
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: urlfilter-db.tar.gz
>
> Hi,
> I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-----------------------------

    Attachment: urlfilter-db.tar.gz
                AddedDbURLFilter.patch

Fixed some issue with swarm cache (removed loading as daemon).
Code cleanup and remarks
Added some logging

> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: MapRed
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-----------------------------

           type: Improvement  (was: New Feature)
    Description:
Hi,

I have written a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml


  was:
Hi,

I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml


    Environment: All Nutch versions  (was: MapRed)

Fixed some issues
clean up
Added a patch for Subversion

> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Michael Ji
hi,

How is performance concern if the size of domain list
reaches 10,000?

Micheal Ji,

--- "Gal Nitzan (JIRA)" <[hidden email]> wrote:

>      [
>
http://issues.apache.org/jira/browse/NUTCH-100?page=all

> ]
>
> Gal Nitzan updated NUTCH-100:
> -----------------------------
>
>            type: Improvement  (was: New Feature)
>     Description:
> Hi,
>
> I have written a new plugin, based on the URLFilter
> interface: urlfilter-db .
>
> The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
>
> The plugin uses a caching system (SwarmCache, easier
> to deploy than JCS) and on the back-end a database.
>
> For each url
>    filter is called
> end for
>
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
>
>
> The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
>
>
>   was:
> Hi,
>
> I have written (not much) a new plugin, based on the
> URLFilter interface: urlfilter-db .
>
> The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
>
> The plugin uses a caching system (SwarmCache, easier
> to deploy than JCS) and on the back-end a database.
>
> For each url
>    filter is called
> end for
>
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
>
>
> The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
>
>
>     Environment: All Nutch versions  (was: MapRed)
>
> Fixed some issues
> clean up
> Added a patch for Subversion
>
> > New plugin urlfilter-db
> > -----------------------
> >
> >          Key: NUTCH-100
> >          URL:
> http://issues.apache.org/jira/browse/NUTCH-100
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.8-dev
> >  Environment: All Nutch versions
> >     Reporter: Gal Nitzan
> >     Priority: Trivial
> >  Attachments: AddedDbURLFilter.patch,
> urlfilter-db.tar.gz, urlfilter-db.tar.gz
> >
> > Hi,
> > I have written a new plugin, based on the
> URLFilter interface: urlfilter-db .
> > The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
> > The plugin uses a caching system (SwarmCache,
> easier to deploy than JCS) and on the back-end a
> database.
> > For each url
> >    filter is called
> > end for
> > filter
> >  get the domain name from url
> >   call cache.get domain
> >   if not in cache try the database
> >   if in database cache it and return it
> >   return null
> > end filter
> > The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of
> the administrators:
>  
>
http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>
>



               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Gal Nitzan
Hi Michael,

At the moment I have about 3000 domains in my db. I didn't time the
performance however having even 100k domains shouldn't have an impact
since it is fetched only once from the database to the cache. A little
performance hit should be over 100k (depends on number elements defined
in xml file).

After a few birth problems, the plugin works nicely and I do not feel
any impact.

Regards,

Gal


Michael Ji wrote:

> hi,
>
> How is performance concern if the size of domain list
> reaches 10,000?
>
> Micheal Ji,
>
> --- "Gal Nitzan (JIRA)" <[hidden email]> wrote:
>
>  
>>      [
>>
>>    
> http://issues.apache.org/jira/browse/NUTCH-100?page=all
>  
>> ]
>>
>> Gal Nitzan updated NUTCH-100:
>> -----------------------------
>>
>>            type: Improvement  (was: New Feature)
>>     Description:
>> Hi,
>>
>> I have written a new plugin, based on the URLFilter
>> interface: urlfilter-db .
>>
>> The purpose of this plugin is to filter domains,
>> i.e. I would like to crawl the world but to fetch
>> only certain domains.
>>
>> The plugin uses a caching system (SwarmCache, easier
>> to deploy than JCS) and on the back-end a database.
>>
>> For each url
>>    filter is called
>> end for
>>
>> filter
>>  get the domain name from url
>>   call cache.get domain
>>   if not in cache try the database
>>   if in database cache it and return it
>>   return null
>> end filter
>>
>>
>> The plugin reads the cache size, jdbc driver,
>> connection string, table to use and domain field
>> from nutch-site.xml
>>
>>
>>   was:
>> Hi,
>>
>> I have written (not much) a new plugin, based on the
>> URLFilter interface: urlfilter-db .
>>
>> The purpose of this plugin is to filter domains,
>> i.e. I would like to crawl the world but to fetch
>> only certain domains.
>>
>> The plugin uses a caching system (SwarmCache, easier
>> to deploy than JCS) and on the back-end a database.
>>
>> For each url
>>    filter is called
>> end for
>>
>> filter
>>  get the domain name from url
>>   call cache.get domain
>>   if not in cache try the database
>>   if in database cache it and return it
>>   return null
>> end filter
>>
>>
>> The plugin reads the cache size, jdbc driver,
>> connection string, table to use and domain field
>> from nutch-site.xml
>>
>>
>>     Environment: All Nutch versions  (was: MapRed)
>>
>> Fixed some issues
>> clean up
>> Added a patch for Subversion
>>
>>    
>>> New plugin urlfilter-db
>>> -----------------------
>>>
>>>          Key: NUTCH-100
>>>          URL:
>>>      
>> http://issues.apache.org/jira/browse/NUTCH-100
>>    
>>>      Project: Nutch
>>>         Type: Improvement
>>>   Components: fetcher
>>>     Versions: 0.8-dev
>>>  Environment: All Nutch versions
>>>     Reporter: Gal Nitzan
>>>     Priority: Trivial
>>>  Attachments: AddedDbURLFilter.patch,
>>>      
>> urlfilter-db.tar.gz, urlfilter-db.tar.gz
>>    
>>> Hi,
>>> I have written a new plugin, based on the
>>>      
>> URLFilter interface: urlfilter-db .
>>    
>>> The purpose of this plugin is to filter domains,
>>>      
>> i.e. I would like to crawl the world but to fetch
>> only certain domains.
>>    
>>> The plugin uses a caching system (SwarmCache,
>>>      
>> easier to deploy than JCS) and on the back-end a
>> database.
>>    
>>> For each url
>>>    filter is called
>>> end for
>>> filter
>>>  get the domain name from url
>>>   call cache.get domain
>>>   if not in cache try the database
>>>   if in database cache it and return it
>>>   return null
>>> end filter
>>> The plugin reads the cache size, jdbc driver,
>>>      
>> connection string, table to use and domain field
>> from nutch-site.xml
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of
>> the administrators:
>>  
>>
>>    
> http://issues.apache.org/jira/secure/Administrators.jspa
>  
>> -
>> For more information on JIRA, see:
>>    http://www.atlassian.com/software/jira
>>
>>
>>    
>
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
> .
>
>  


Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Otis Gospodnetic-2-2
Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.

Thanks,
Otis


--- Gal Nitzan <[hidden email]> wrote:

> Hi Michael,
>
> At the moment I have about 3000 domains in my db. I didn't time the
> performance however having even 100k domains shouldn't have an impact
>
> since it is fetched only once from the database to the cache. A
> little
> performance hit should be over 100k (depends on number elements
> defined
> in xml file).
>
> After a few birth problems, the plugin works nicely and I do not feel
>
> any impact.
>
> Regards,
>
> Gal
>
>
> Michael Ji wrote:
> > hi,
> >
> > How is performance concern if the size of domain list
> > reaches 10,000?
> >
> > Micheal Ji,
> >
> > --- "Gal Nitzan (JIRA)" <[hidden email]> wrote:
> >
> >  
> >>      [
> >>
> >>    
> > http://issues.apache.org/jira/browse/NUTCH-100?page=all
> >  
> >> ]
> >>
> >> Gal Nitzan updated NUTCH-100:
> >> -----------------------------
> >>
> >>            type: Improvement  (was: New Feature)
> >>     Description:
> >> Hi,
> >>
> >> I have written a new plugin, based on the URLFilter
> >> interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>   was:
> >> Hi,
> >>
> >> I have written (not much) a new plugin, based on the
> >> URLFilter interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>     Environment: All Nutch versions  (was: MapRed)
> >>
> >> Fixed some issues
> >> clean up
> >> Added a patch for Subversion
> >>
> >>    
> >>> New plugin urlfilter-db
> >>> -----------------------
> >>>
> >>>          Key: NUTCH-100
> >>>          URL:
> >>>      
> >> http://issues.apache.org/jira/browse/NUTCH-100
> >>    
> >>>      Project: Nutch
> >>>         Type: Improvement
> >>>   Components: fetcher
> >>>     Versions: 0.8-dev
> >>>  Environment: All Nutch versions
> >>>     Reporter: Gal Nitzan
> >>>     Priority: Trivial
> >>>  Attachments: AddedDbURLFilter.patch,
> >>>      
> >> urlfilter-db.tar.gz, urlfilter-db.tar.gz
> >>    
> >>> Hi,
> >>> I have written a new plugin, based on the
> >>>      
> >> URLFilter interface: urlfilter-db .
> >>    
> >>> The purpose of this plugin is to filter domains,
> >>>      
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>    
> >>> The plugin uses a caching system (SwarmCache,
> >>>      
> >> easier to deploy than JCS) and on the back-end a
> >> database.
> >>    
> >>> For each url
> >>>    filter is called
> >>> end for
> >>> filter
> >>>  get the domain name from url
> >>>   call cache.get domain
> >>>   if not in cache try the database
> >>>   if in database cache it and return it
> >>>   return null
> >>> end filter
> >>> The plugin reads the cache size, jdbc driver,
> >>>      
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >> --
> >> This message is automatically generated by JIRA.
> >> -
> >> If you think it was sent incorrectly contact one of
> >> the administrators:
> >>  
> >>
> >>    
> > http://issues.apache.org/jira/secure/Administrators.jspa
> >  
> >> -
> >> For more information on JIRA, see:
> >>    http://www.atlassian.com/software/jira
> >>
> >>
> >>    
> >
> >
> >
> >
> > __________________________________
> > Yahoo! Music Unlimited
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> > .
> >
> >  
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Andrzej Białecki-2
[hidden email] wrote:
> Hi Gal,
>
> I'm curious about the memory consumption of the cache and the speed of
> retrieval of an item from the cache, when the cache has 100k domains in
> it.

Slightly off-topic, but I hope this is relevant to the original reason
for creating this plugin...

There is a BSD-licensed library that implements a large subset of
regexps, which is based on finite automata. It is reported to be
scalable and very fast (benchmarks are surely impressive):

        http://www.brics.dk/~amoeller/automaton/

I suggest to do some tests with 100k regexps and see if it survives.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Gal Nitzan
In reply to this post by Otis Gospodnetic-2-2
Hi Otis,

I have only a few thousands urls in my db at the moment. However, for a
100K it should be about 600-800KB. I do not cache the url itself, only a
hash string. So the next time a url is searched in the cache if the hash
exists than it is allowed.

Regards,

Gal

[hidden email] wrote:

> Hi Gal,
>
> I'm curious about the memory consumption of the cache and the speed of
> retrieval of an item from the cache, when the cache has 100k domains in
> it.
>
> Thanks,
> Otis
>
>
> --- Gal Nitzan <[hidden email]> wrote:
>
>  
>> Hi Michael,
>>
>> At the moment I have about 3000 domains in my db. I didn't time the
>> performance however having even 100k domains shouldn't have an impact
>>
>> since it is fetched only once from the database to the cache. A
>> little
>> performance hit should be over 100k (depends on number elements
>> defined
>> in xml file).
>>
>> After a few birth problems, the plugin works nicely and I do not feel
>>
>> any impact.
>>
>> Regards,
>>
>> Gal
>>
>>
>> Michael Ji wrote:
>>    
>>> hi,
>>>
>>> How is performance concern if the size of domain list
>>> reaches 10,000?
>>>
>>> Micheal Ji,
>>>
>>> --- "Gal Nitzan (JIRA)" <[hidden email]> wrote:
>>>
>>>  
>>>      
>>>>      [
>>>>
>>>>    
>>>>        
>>> http://issues.apache.org/jira/browse/NUTCH-100?page=all
>>>  
>>>      
>>>> ]
>>>>
>>>> Gal Nitzan updated NUTCH-100:
>>>> -----------------------------
>>>>
>>>>            type: Improvement  (was: New Feature)
>>>>     Description:
>>>> Hi,
>>>>
>>>> I have written a new plugin, based on the URLFilter
>>>> interface: urlfilter-db .
>>>>
>>>> The purpose of this plugin is to filter domains,
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>
>>>> The plugin uses a caching system (SwarmCache, easier
>>>> to deploy than JCS) and on the back-end a database.
>>>>
>>>> For each url
>>>>    filter is called
>>>> end for
>>>>
>>>> filter
>>>>  get the domain name from url
>>>>   call cache.get domain
>>>>   if not in cache try the database
>>>>   if in database cache it and return it
>>>>   return null
>>>> end filter
>>>>
>>>>
>>>> The plugin reads the cache size, jdbc driver,
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>>
>>>>   was:
>>>> Hi,
>>>>
>>>> I have written (not much) a new plugin, based on the
>>>> URLFilter interface: urlfilter-db .
>>>>
>>>> The purpose of this plugin is to filter domains,
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>
>>>> The plugin uses a caching system (SwarmCache, easier
>>>> to deploy than JCS) and on the back-end a database.
>>>>
>>>> For each url
>>>>    filter is called
>>>> end for
>>>>
>>>> filter
>>>>  get the domain name from url
>>>>   call cache.get domain
>>>>   if not in cache try the database
>>>>   if in database cache it and return it
>>>>   return null
>>>> end filter
>>>>
>>>>
>>>> The plugin reads the cache size, jdbc driver,
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>>
>>>>     Environment: All Nutch versions  (was: MapRed)
>>>>
>>>> Fixed some issues
>>>> clean up
>>>> Added a patch for Subversion
>>>>
>>>>    
>>>>        
>>>>> New plugin urlfilter-db
>>>>> -----------------------
>>>>>
>>>>>          Key: NUTCH-100
>>>>>          URL:
>>>>>      
>>>>>          
>>>> http://issues.apache.org/jira/browse/NUTCH-100
>>>>    
>>>>        
>>>>>      Project: Nutch
>>>>>         Type: Improvement
>>>>>   Components: fetcher
>>>>>     Versions: 0.8-dev
>>>>>  Environment: All Nutch versions
>>>>>     Reporter: Gal Nitzan
>>>>>     Priority: Trivial
>>>>>  Attachments: AddedDbURLFilter.patch,
>>>>>      
>>>>>          
>>>> urlfilter-db.tar.gz, urlfilter-db.tar.gz
>>>>    
>>>>        
>>>>> Hi,
>>>>> I have written a new plugin, based on the
>>>>>      
>>>>>          
>>>> URLFilter interface: urlfilter-db .
>>>>    
>>>>        
>>>>> The purpose of this plugin is to filter domains,
>>>>>      
>>>>>          
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>    
>>>>        
>>>>> The plugin uses a caching system (SwarmCache,
>>>>>      
>>>>>          
>>>> easier to deploy than JCS) and on the back-end a
>>>> database.
>>>>    
>>>>        
>>>>> For each url
>>>>>    filter is called
>>>>> end for
>>>>> filter
>>>>>  get the domain name from url
>>>>>   call cache.get domain
>>>>>   if not in cache try the database
>>>>>   if in database cache it and return it
>>>>>   return null
>>>>> end filter
>>>>> The plugin reads the cache size, jdbc driver,
>>>>>      
>>>>>          
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> -
>>>> If you think it was sent incorrectly contact one of
>>>> the administrators:
>>>>  
>>>>
>>>>    
>>>>        
>>> http://issues.apache.org/jira/secure/Administrators.jspa
>>>  
>>>      
>>>> -
>>>> For more information on JIRA, see:
>>>>    http://www.atlassian.com/software/jira
>>>>
>>>>
>>>>    
>>>>        
>>>
>>>
>>> __________________________________
>>> Yahoo! Music Unlimited
>>> Access over 1 million songs. Try it free.
>>> http://music.yahoo.com/unlimited/
>>>
>>> .
>>>
>>>  
>>>      
>>
>>    
>
>
> .
>
>  


Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Gal Nitzan
In reply to this post by Andrzej Białecki-2
Hi,

Well, the reason for this plugin is that i wish to crawl many sites but
they all must be in my list. If it was implemented with regular
expressions, the filter would still have to loop 100K expressions on
each url for a match right?

Regards,

Gal

Andrzej Bialecki wrote:

> [hidden email] wrote:
>> Hi Gal,
>>
>> I'm curious about the memory consumption of the cache and the speed of
>> retrieval of an item from the cache, when the cache has 100k domains in
>> it.
>
> Slightly off-topic, but I hope this is relevant to the original reason
> for creating this plugin...
>
> There is a BSD-licensed library that implements a large subset of
> regexps, which is based on finite automata. It is reported to be
> scalable and very fast (benchmarks are surely impressive):
>
>     http://www.brics.dk/~amoeller/automaton/
>
> I suggest to do some tests with 100k regexps and see if it survives.
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Andrzej Białecki-2
Gal Nitzan wrote:
> Hi,
>
> Well, the reason for this plugin is that i wish to crawl many sites but
> they all must be in my list. If it was implemented with regular
> expressions, the filter would still have to loop 100K expressions on
> each url for a match right?

No, that's the whole point - using the library I mentioned you can build
a _single_ finite state automaton from all expressions. No looping, just
traversing a tree (or whatever equivalent structure they use).

100k regexps is still alot, so I'm not totally sure it would be much
faster, but perhaps worth checking.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Gal Nitzan
Hi Andrzej,

Yes, it seems like a good option. However, it is GPL, and I noticed in
one of the posts that this license is no good for apach.org :).

Regards,

Gal


Andrzej Bialecki wrote:

> Gal Nitzan wrote:
>> Hi,
>>
>> Well, the reason for this plugin is that i wish to crawl many sites
>> but they all must be in my list. If it was implemented with regular
>> expressions, the filter would still have to loop 100K expressions on
>> each url for a match right?
>
> No, that's the whole point - using the library I mentioned you can
> build a _single_ finite state automaton from all expressions. No
> looping, just traversing a tree (or whatever equivalent structure they
> use).
>
> 100k regexps is still alot, so I'm not totally sure it would be much
> faster, but perhaps worth checking.
>


Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Andrzej Białecki-2
Gal Nitzan wrote:
> Hi Andrzej,
>
> Yes, it seems like a good option. However, it is GPL, and I noticed in
> one of the posts that this license is no good for apach.org :).

If you refer to the bricks automata library, it's BSD-licensed.  I
mentioned in one of the posts that the Innovation httpclient is L-GPL,
and hence not acceptable for apache.org.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Doug Cutting-2
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:
> 100k regexps is still alot, so I'm not totally sure it would be much
> faster, but perhaps worth checking.

I have worked with this type of technology before (minimized,
determinized FSAs, constructed from large sets of strings & expressions)
and it should be very fast to perform lookups, even in large, complex
FSAs.  Construction of the FSA can be time consuming and should probably
be done offline, not at fetcher startup time, so that it is only
performed once for a number of fetcher runs.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Andrzej Białecki-2
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> 100k regexps is still alot, so I'm not totally sure it would be much
>> faster, but perhaps worth checking.
>
>
> I have worked with this type of technology before (minimized,
> determinized FSAs, constructed from large sets of strings & expressions)
> and it should be very fast to perform lookups, even in large, complex
> FSAs.  Construction of the FSA can be time consuming and should probably
> be done offline, not at fetcher startup time, so that it is only
> performed once for a number of fetcher runs.

Guess what... this library supports (de)serialization of automata, so
they can be compiled once, and then just stored/loaded.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-100) New plugin urlfilter-db

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367706 ]

Fuad Efendi commented on NUTCH-100:
-----------------------------------

Please avoid this:

  public void finalize() throws Throwable {
    cleanup();
  }

- In case of an Exception, GC will ignore Throwable, and this object won't be garbage collected...


> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-100) New plugin urlfilter-db

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367721 ]

Fuad Efendi commented on NUTCH-100:
-----------------------------------

Please, add port number:
if (u.getPort()!=-1 || u.getPort()!=80) {
   ret = ret + ":" + u.getPort();
}


> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-100) New plugin urlfilter-db

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367731 ]

Fuad Efendi commented on NUTCH-100:
-----------------------------------

Sorry, should be [&&] instead of [||] in previous comment



> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira