Need help with URL regex

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Need help with URL regex

Lucas Rockwell
Hi all,

I have look in the archive and have followed the instructions in the
tutorial and I am still having problems limiting nutch to just my site.

For instance, the tutorial reads:

        2.   Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME
with the name of the domain you wish to crawl. For example, if you
wished to limit the crawl to the nutch.org domain, the line should
read:
+^http://([a-z0-9]*\.)*nutch.org/

But when I test the above regex according to a comment in the archives
on April 16 using:

cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter

I get this for the output:

<snip>
+# skip URLs containing certain characters as probable queries, etc.
--[?*!@=]
-
+# limit to org site only
-+^http://([a-z0-9]*\.)*nutch.org/
-
+# do not accept anything else
++.
</snip>

So, according to to the filter test, the regex in the tutorial does not
work. Also, when I use Doug's example from another email
(+^http://www.cs.princeton.edu/(people/(grad|fac)\.php)?$) I also get
the "-" sign when I run the test. Also, the "-[?*!@=]" also gets a "-"
sign...

So, can anyone out there give me the exact syntax so that nutch will
*only* crawl the domain (and subdomain(s)) for the site I want to
crawl?

Many thanks.

-lucas

Reply | Threaded
Open this post in threaded view
|

[Solved - probably] Re: Need help with URL regex

Lucas Rockwell
Hi again,

I think I may have it working now. Not exactly the way I want, but I
put this:

+^http://www.myorg.org/*

and

-.

and not it seems to just be picking up things from my site. But have
not tried to put the ([a-z0-9]*\.)* back in.

-lucas

On May 8, 2005, at 4:54 PM, Lucas Rockwell wrote:

> Hi all,
>
> I have look in the archive and have followed the instructions in the
> tutorial and I am still having problems limiting nutch to just my
> site.
>
> For instance, the tutorial reads:
>
> 2.   Edit the file conf/crawl-urlfilter.txt and replace
> MY.DOMAIN.NAME with the name of the domain you wish to crawl. For
> example, if you wished to limit the crawl to the nutch.org domain, the
> line should read:
> +^http://([a-z0-9]*\.)*nutch.org/
>
> But when I test the above regex according to a comment in the archives
> on April 16 using:
>
> cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter
>
> I get this for the output:
>
> <snip>
> +# skip URLs containing certain characters as probable queries, etc.
> --[?*!@=]
> -
> +# limit to org site only
> -+^http://([a-z0-9]*\.)*nutch.org/
> -
> +# do not accept anything else
> ++.
> </snip>
>
> So, according to to the filter test, the regex in the tutorial does
> not work. Also, when I use Doug's example from another email
> (+^http://www.cs.princeton.edu/(people/(grad|fac)\.php)?$) I also get
> the "-" sign when I run the test. Also, the "-[?*!@=]" also gets a "-"
> sign...
>
> So, can anyone out there give me the exact syntax so that nutch will
> *only* crawl the domain (and subdomain(s)) for the site I want to
> crawl?
>
> Many thanks.
>
> -lucas
>