Urlfilter Patch

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Urlfilter Patch

Rod Taylor-2
Add a few more extensions which I commonly see and cannot be parsed
(that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.

Add in additional lines (commented out by default) for quickly rejecting
URLs for extended content areas (doc, png, pdf, rtf, etc.) for people
who do not want anything but HTML or items with URLs that can get us the
HTML.

--
Rod Taylor <[hidden email]>

urlfilter.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Doug Cutting-2
Rod Taylor wrote:
> Add a few more extensions which I commonly see and cannot be parsed
> (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.

[ ... ]

>  # skip image and other suffixes we can't yet parse
> --\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
> +-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$

Is the '||' intentional, or a typo?  Do you mean to prohibit files
ending with just '.'?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Rod Taylor-2
On Mon, 2005-11-28 at 11:44 -0800, Doug Cutting wrote:

> Rod Taylor wrote:
> > Add a few more extensions which I commonly see and cannot be parsed
> > (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.
>
> [ ... ]
>
> >  # skip image and other suffixes we can't yet parse
> > --\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
> > +-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$
>
> Is the '||' intentional, or a typo?  Do you mean to prohibit files
> ending with just '.'?

It is just a typo.

I rearranged some of the names before submitting and must have done it
then.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

kkrugler
>On Mon, 2005-11-28 at 11:44 -0800, Doug Cutting wrote:
>>  Rod Taylor wrote:
>>  > Add a few more extensions which I commonly see and cannot be parsed
>>  > (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.
>>
>>  [ ... ]
>>
>>  >  # skip image and other suffixes we can't yet parse
>>  >
>>--\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>  > >
>+-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$

For what it's worth, below is the filter list we're using for doing
an html-centric crawl (no word docs, for example). Using the (?i)
means we don't need to have upper & lower-case versions of the
suffixes.

-- Ken

-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$


--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Doug Cutting-2
Ken Krugler wrote:
> For what it's worth, below is the filter list we're using for doing an
> html-centric crawl (no word docs, for example). Using the (?i) means we
> don't need to have upper & lower-case versions of the suffixes.
>
> -(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$

This looks like a more complete suffix list.

Should we use this as the default?  By default only html and text
parsers are enabled, so perhaps that's all we should accept.

Why do you exclude .php urls?  These are simply dynamic pages, no?
Similarly, .jsp and .py are frequently suffixes that return html.  Are
there other suffixes we should remove from this list before we make it
the default exclusion list?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Howie Wang
.pl  files are often just perl CGI scripts. And .xhtml seem like they
would be parsable by the default HTML parser.

Howie

>From: Doug Cutting <[hidden email]>
>
>Ken Krugler wrote:
>>For what it's worth, below is the filter list we're using for doing an
>>html-centric crawl (no word docs, for example). Using the (?i) means we
>>don't need to have upper & lower-case versions of the suffixes.
>>
>>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
>
>This looks like a more complete suffix list.
>
>Should we use this as the default?  By default only html and text parsers
>are enabled, so perhaps that's all we should accept.
>
>Why do you exclude .php urls?  These are simply dynamic pages, no?
>Similarly, .jsp and .py are frequently suffixes that return html.  Are
>there other suffixes we should remove from this list before we make it the
>default exclusion list?
>
>Doug


Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Rod Taylor-2
On Thu, 2005-12-01 at 18:53 +0000, Howie Wang wrote:
> .And .xhtml seem like they
> would be parsable by the default HTML parser.

Ditto for .xml. It is a valid (though seldom used) xhtml extension.

> Howie
>
> >From: Doug Cutting <[hidden email]>
> >
> >Ken Krugler wrote:
> >>For what it's worth, below is the filter list we're using for doing an
> >>html-centric crawl (no word docs, for example). Using the (?i) means we
> >>don't need to have upper & lower-case versions of the suffixes.
> >>
> >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
> >
> >This looks like a more complete suffix list.
> >
> >Should we use this as the default?  By default only html and text parsers
> >are enabled, so perhaps that's all we should accept.
> >
> >Why do you exclude .php urls?  These are simply dynamic pages, no?
> >Similarly, .jsp and .py are frequently suffixes that return html.  Are
> >there other suffixes we should remove from this list before we make it the
> >default exclusion list?
> >
> >Doug
>
>
>
--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Jérôme Charron
Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fecth process.
No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

chrismattmann
Jerome,

 I think that this is a great idea and ensures that there isn't replication
of so-called "management information" across the system. It could be easily
implemented as a utility method because we have utility java classes that
represent the ParsePluginList, that you could get the mimeTypes from.
Additionally, we could create a utility method that searches the extension
point list for parsing plugins and returns a boolean true or false whether
they are activated or not. Using this information, I believe that the url
filtering would be a snap.

 

+1

Cheers,
  Chris



On 12/1/05 12:11 PM, "Jérôme Charron" <[hidden email]> wrote:

> Suggestion:
> For consistency purpose, and easy of nutch management, why not filtering the
> extensions based on the activated plugins?
> By looking at the mime-types defined in the parse-plugins.xml file and the
> activated plugins, we know which content-types will be parsed.
> So, by getting the file extensions associated to each content-type, we can
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
> No?
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Piotr Kosiorowski
In reply to this post by Jérôme Charron
Jérôme Charron wrote:
[...]
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
[...]
I would not like to exclude all others - as for example many extensions
are valid for html - especially dynamicly generated pages (jsp,asp,cgi
just to name the easy ones and a lot of custom ones).  But the idea of
automatically allowing extensions for which plugins are enabled is good
in my opinion.
Anyway I will try to find my own list of forbidden extensions I prepared
based on  80mln of urls - I just prepared the list of most common ones
and went through it manually. I will try to find it over weekend so we
can combine it with the list discussed in this thread.
P.


Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Doug Cutting-2
In reply to this post by Jérôme Charron
J?r?me Charron wrote:
> For consistency purpose, and easy of nutch management, why not filtering the
> extensions based on the activated plugins?
> By looking at the mime-types defined in the parse-plugins.xml file and the
> activated plugins, we know which content-types will be parsed.
> So, by getting the file extensions associated to each content-type, we can
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
> No?

What about a site that develops a content system that has urls that end
in .foo, which we would exclude, even though they return html?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

chrismattmann
Hi Doug,


On 12/1/05 1:11 PM, "Doug Cutting" <[hidden email]> wrote:

> Jérôme Charron wrote:
[...]
>
> What about a site that develops a content system that has urls that end
> in .foo, which we would exclude, even though they return html?
>
> Doug

  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like. I'm not sure if the mime type registry is
there yet, but I know that Jerome was working on a major update that would
help in recognizing these types of situations. Of course, efficiency comes
into play as well, in terms of now slowing down the fetch/parse, but it
would be nice to have a general solution that made use of the information
available in parse-plugins.xml to determine the appropriate set of allowed
extensions in a URLFilter, if possible. It may be a pipe dream, but I'd say
it's worth exploring...

Cheers,
  Chris



______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Doug Cutting-2
Chris Mattmann wrote:
>   In principle, the mimeType system should give us some guidance on
> determining the appropriate mimeType for the content, regardless of whether
> it ends in .foo, .bar or the like.

Right, but the URL filters run long before we know the mime type, in
order to try to keep us from fetching lots of stuff we can't process.
The mime type is not known until we've fetched it.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Jérôme Charron
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the document's URL.
So, another alternative, could be to exclude only files extensions that are
registered in the mime-type repository
(some well known file extensions) but for which no parser is activated. And
accepting all other ones.
So that the .foo files will be fetched...

Jérôme
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

kangas
In reply to this post by Doug Cutting-2
The latter is not strictly true. Nutch could issue an HTTP HEAD  
before the HTTP GET, and determine the mime-type before actually  
grabbing the content.

It's not how Nutch works now, but this might be more useful than a  
super-detailed set of regexes...

kangas@kangas-dev:~$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.localdomain.
Escape character is '^]'.
HEAD / HTTP/1.0

HTTP/1.1 200 OK
Date: Thu, 01 Dec 2005 21:25:38 GMT
Server: Apache/2.0
Connection: close
Content-Type: text/html; charset=UTF-8

Connection closed by foreign host



On Dec 1, 2005, at 4:21 PM, Doug Cutting wrote:

> Chris Mattmann wrote:
>>   In principle, the mimeType system should give us some guidance on
>> determining the appropriate mimeType for the content, regardless  
>> of whether
>> it ends in .foo, .bar or the like.
>
> Right, but the URL filters run long before we know the mime type,  
> in order to try to keep us from fetching lots of stuff we can't  
> process. The mime type is not known until we've fetched it.
>
> Doug

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

Doug Cutting-2
Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  before
> the HTTP GET, and determine the mime-type before actually  grabbing the
> content.
>
> It's not how Nutch works now, but this might be more useful than a  
> super-detailed set of regexes...

This could be a useful addition, but it could not replace url-based
filters.  A HEAD request must still be polite, so this could
substantially slow fetching, as it would incur more delays.  Also, for
most dynamic pages, a HEAD is as expensive for the server as a GET, so
this would cause more load on servers.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

kkrugler
In reply to this post by Jérôme Charron
>Suggestion:
>For consistency purpose, and easy of nutch management, why not filtering the
>extensions based on the activated plugins?
>By looking at the mime-types defined in the parse-plugins.xml file and the
>activated plugins, we know which content-types will be parsed.
>So, by getting the file extensions associated to each content-type, we can
>build a list of file extensions to include (other ones will be excluded) in
>the fetch process.

I'd asked a Nutch consultant this exact same question a few months ago.

It does seem odd that there's an implicit dependency between the file
suffixes found in regex-urlfilter.txt and the enabled plug-ins found
in nutch-default.xml and nutch-site.xml. What's the point of
downloading a 100MB .bz2 file if there's nobody available to handle
it?

It's also odd that there's a nutch-site.xml, but no equivalent for
regex-urlfilter.txt.

There are the cases of some suffixes (like .php) that can return any
kind of mime-type content, and other suffixes (like .xml) that can
mean any number of things. So I think you'd still want
regex-urlfilter.txt files (both a default and a site version) that
provide explicit additions/deletions to the list generated from the
installed and enabled parse-plugins.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter Patch

kkrugler
In reply to this post by Rod Taylor-2
Agreed - looks like this list is too aggressive. A better one would be:

-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|png|pps|ppt|ps|psd|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$

This removes xhtml, xml, php, jsp, py, pl, and cgi.

We've seen php/jsp/py/pl/cgi in our error logs as un-parsable, but
looks like most cases are when the server is miss-configured and
winds up returning the source code, as opposed to the result of
executing the code.

-- Ken

>On Thu, 2005-12-01 at 18:53 +0000, Howie Wang wrote:
>>  .And .xhtml seem like they
>>  would be parsable by the default HTML parser.
>
>Ditto for .xml. It is a valid (though seldom used) xhtml extension.
>
>>  Howie
>>
>>  >From: Doug Cutting <[hidden email]>
>>  >
>>  >Ken Krugler wrote:
>>  >>For what it's worth, below is the filter list we're using for doing an
>>  >>html-centric crawl (no word docs, for example). Using the (?i) means we
>>  >>don't need to have upper & lower-case versions of the suffixes.
>>  >>
>  > >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
>  > >
>>  >This looks like a more complete suffix list.
>>  >
>>  >Should we use this as the default?  By default only html and text parsers
>>  >are enabled, so perhaps that's all we should accept.
>>  >
>  > >Why do you exclude .php urls?  These are simply dynamic pages, no?
>  > >Similarly, .jsp and .py are frequently suffixes that return html.  Are
>>  >there other suffixes we should remove from this list before we make it the
>>  >default exclusion list?
>>  >
>>  >Doug
>>
>>
>>
>--
>Rod Taylor <[hidden email]>


--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

RE: Urlfilter Patch

chrismattmann
In reply to this post by Doug Cutting-2
Hi Doug,

>
> Chris Mattmann wrote:
> >   In principle, the mimeType system should give us some guidance on
> > determining the appropriate mimeType for the content, regardless of
> whether
> > it ends in .foo, .bar or the like.
>
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Duh, you're right. Sorry about that.

Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  
> before the HTTP GET, and determine the mime-type before actually  
> grabbing the content.
>
> It's not how Nutch works now, but this might be more useful than a
> super-detailed set of regexes...


I liked Matt's idea of the HEAD request though. I wonder if some benchmarks
on performance of this would be useful, because in some cases (such as
focused crawling, or "non-whole-internet" crawling, such as intranet, etc.),
it would seem that the performance penalty of performing the HEAD to get the
content-type would be useful, and worth the cost...

Cheers,
  Chris



Reply | Threaded
Open this post in threaded view
|

RE: Urlfilter Patch

chrismattmann
In reply to this post by Jérôme Charron
Hi Jerome,

> Yes, the fetcher can't rely on the document mime-type.
> The only thing we can use for filtering is the document's URL.
> So, another alternative, could be to exclude only files extensions that
> are
> registered in the mime-type repository
> (some well known file extensions) but for which no parser is activated.
> And
> accepting all other ones.
> So that the .foo files will be fetched...

Yup, the key phrase is "well known". It would sort of be an optimization, or
heuristic, to save some work on the regex...

Cheers,
  Chris


>
> Jérôme

12