RegexURLFilter / testing regex-urlfilter.txt

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

RegexURLFilter / testing regex-urlfilter.txt

Thomas Delnoij-3
All.

I want to run the RegexURLFilter's main() method for testing the
regex-urlfilter.txt.

I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my environment
correctly.

When I run nutch org.apache.nutch.net.RegexURLFilter I get Exception in
thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/net/RegexURLFilter.

Assuming this was a classpath issue, I added
NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my classpath.

This did not solve the problem, as I am still getting the
NoClassDefFoundError.

So my first question is how to set up my environment correctly for testing
the regex-urlfilter.

Secondly, I want to tune my regex-urlfilter for maximum relevancy of the
crawl result. By now, I have around 50 entries. My second question is if I
can expect any performance impact?

Your help is greatly appreciated.

Kind regards, Thomas Delnoij.
Reply | Threaded
Open this post in threaded view
|

Re: RegexURLFilter / testing regex-urlfilter.txt

Thomas Delnoij-3
I was a bit in a hurry when I posted this message, apologies.

The problem is actualy a bit different.

I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.

When I run java org.apache.nutch.net.RegexURLFilter,

On 10/5/05, Thomas Delnoij < [hidden email]> wrote:

>
> All.
>
> I want to run the RegexURLFilter's main() method for testing the
> regex-urlfilter.txt.
>
> I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my environment
> correctly.
>
> When I run nutch org.apache.nutch.net.RegexURLFilter I get Exception in
> thread "main" java.lang.NoClassDefFoundError:
> org/apache/nutch/net/RegexURLFilter.
>
> Assuming this was a classpath issue, I added
> NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my classpath.
>
> This did not solve the problem, as I am still getting the
> NoClassDefFoundError.
>
> So my first question is how to set up my environment correctly for testing
> the regex-urlfilter.
>
> Secondly, I want to tune my regex-urlfilter for maximum relevancy of the
> crawl result. By now, I have around 50 entries. My second question is if I
> can expect any performance impact?
>
> Your help is greatly appreciated.
>
> Kind regards, Thomas Delnoij.
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexURLFilter / testing regex-urlfilter.txt

Thomas Delnoij-3
All.

The problem is actualy a bit different. I was a bit in a hurry when I posted
the previous message, apologies.

I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.

When I run java org.apache.nutch.net.RegexURLFilter, I am getting

051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
0.7.1.jar!/nutch-default.xml
051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
0.7.1.jar!/nutch-site.xml
051005 221040 Plugins: directory not found: plugins
Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: java.lang.NullPointerException
at org.apache.nutch.net.RegexURLFilter.<clinit>(RegexURLFilter.java:64)

when I run nutch org.apache.nutch.net.RegexURLFilter, I am getting

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/net/RegexURLFilter

I know I am missing something obvious, but your help is really appreciated.

Kind regards, Thomas Delnoij


On 10/5/05, Thomas Delnoij <[hidden email]> wrote:

>
> I was a bit in a hurry when I posted this message, apologies.
>
> The problem is actualy a bit different.
>
> I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
>
> When I run java org.apache.nutch.net.RegexURLFilter,
>
> On 10/5/05, Thomas Delnoij < [hidden email]> wrote:
> >
> > All.
> >
> > I want to run the RegexURLFilter's main() method for testing the
> > regex-urlfilter.txt.
> >
> > I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my
> > environment correctly.
> >
> > When I run nutch org.apache.nutch.net.RegexURLFilter I get Exception in
> > thread "main" java.lang.NoClassDefFoundError:
> > org/apache/nutch/net/RegexURLFilter.
> >
> > Assuming this was a classpath issue, I added
> > NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my classpath.
> >
> > This did not solve the problem, as I am still getting the
> > NoClassDefFoundError.
> >
> > So my first question is how to set up my environment correctly for
> > testing the regex-urlfilter.
> >
> > Secondly, I want to tune my regex-urlfilter for maximum relevancy of the
> > crawl result. By now, I have around 50 entries. My second question is if I
> > can expect any performance impact?
> >
> > Your help is greatly appreciated.
> >
> > Kind regards, Thomas Delnoij.
> >
> >
> >
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexURLFilter / testing regex-urlfilter.txt

Thomas Delnoij-3
For the sake of the archives, I will answer my own question here: I had to
add the following line to the bin/nutch script to be able to run
org.apache.nutch.net.RegexURLFilter from the command line:

CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar

The nutch script overrides the classpath environment variable, so adding the
jar there didn't help.

Rgrds, Thomas Delnoij


On 10/5/05, Thomas Delnoij <[hidden email]> wrote:

>
> All.
>
> The problem is actualy a bit different. I was a bit in a hurry when I
> posted the previous message, apologies.
>
> I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
>
> When I run java org.apache.nutch.net.RegexURLFilter, I am getting
>
> 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
> 0.7.1.jar!/nutch-default.xml
> 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
> 0.7.1.jar!/nutch-site.xml
> 051005 221040 Plugins: directory not found: plugins
> Exception in thread "main" java.lang.ExceptionInInitializerError
> Caused by: java.lang.NullPointerException
>         at org.apache.nutch.net.RegexURLFilter.<clinit>(
> RegexURLFilter.java:64)
>
> when I run nutch org.apache.nutch.net.RegexURLFilter, I am getting
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/nutch/net/RegexURLFilter
>
> I know I am missing something obvious, but your help is really
> appreciated.
>
> Kind regards, Thomas Delnoij
>
>
> On 10/5/05, Thomas Delnoij <[hidden email]> wrote:
> >
> > I was a bit in a hurry when I posted this message, apologies.
> >
> > The problem is actualy a bit different.
> >
> > I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
> >
> > When I run java org.apache.nutch.net.RegexURLFilter,
> >
> > On 10/5/05, Thomas Delnoij < [hidden email]> wrote:
> > >
> > > All.
> > >
> > > I want to run the RegexURLFilter's main() method for testing the
> > > regex-urlfilter.txt.
> > >
> > > I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my
> > > environment correctly.
> > >
> > > When I run nutch org.apache.nutch.net.RegexURLFilter I get Exception
> > > in thread "main" java.lang.NoClassDefFoundError:
> > > org/apache/nutch/net/RegexURLFilter.
> > >
> > > Assuming this was a classpath issue, I added
> > > NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my
> > > classpath.
> > >
> > > This did not solve the problem, as I am still getting the
> > > NoClassDefFoundError.
> > >
> > > So my first question is how to set up my environment correctly for
> > > testing the regex-urlfilter.
> > >
> > > Secondly, I want to tune my regex-urlfilter for maximum relevancy of
> > > the crawl result. By now, I have around 50 entries. My second question is if
> > > I can expect any performance impact?
> > >
> > > Your help is greatly appreciated.
> > >
> > > Kind regards, Thomas Delnoij.
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexURLFilter / testing regex-urlfilter.txt

Bryan Woliner
Sorry if the answer to this question should be obvious, but where in
the bin/nutch script do you need to add the following line to be able
to test your regex-urlfilter.txt file from the command line?

CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar



On 11/29/05, Thomas Delnoij <[hidden email]> wrote:

> For the sake of the archives, I will answer my own question here: I had to
> add the following line to the bin/nutch script to be able to run
> org.apache.nutch.net.RegexURLFilter from the command line:
>
> CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar
>
> The nutch script overrides the classpath environment variable, so adding the
> jar there didn't help.
>
> Rgrds, Thomas Delnoij
>
>
> On 10/5/05, Thomas Delnoij <[hidden email]> wrote:
> >
> > All.
> >
> > The problem is actualy a bit different. I was a bit in a hurry when I
> > posted the previous message, apologies.
> >
> > I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
> >
> > When I run java org.apache.nutch.net.RegexURLFilter, I am getting
> >
> > 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
> > 0.7.1.jar!/nutch-default.xml
> > 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
> > 0.7.1.jar!/nutch-site.xml
> > 051005 221040 Plugins: directory not found: plugins
> > Exception in thread "main" java.lang.ExceptionInInitializerError
> > Caused by: java.lang.NullPointerException
> >         at org.apache.nutch.net.RegexURLFilter.<clinit>(
> > RegexURLFilter.java:64)
> >
> > when I run nutch org.apache.nutch.net.RegexURLFilter, I am getting
> >
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/nutch/net/RegexURLFilter
> >
> > I know I am missing something obvious, but your help is really
> > appreciated.
> >
> > Kind regards, Thomas Delnoij
> >
> >
> > On 10/5/05, Thomas Delnoij <[hidden email]> wrote:
> > >
> > > I was a bit in a hurry when I posted this message, apologies.
> > >
> > > The problem is actualy a bit different.
> > >
> > > I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
> > >
> > > When I run java org.apache.nutch.net.RegexURLFilter,
> > >
> > > On 10/5/05, Thomas Delnoij < [hidden email]> wrote:
> > > >
> > > > All.
> > > >
> > > > I want to run the RegexURLFilter's main() method for testing the
> > > > regex-urlfilter.txt.
> > > >
> > > > I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my
> > > > environment correctly.
> > > >
> > > > When I run nutch org.apache.nutch.net.RegexURLFilter I get Exception
> > > > in thread "main" java.lang.NoClassDefFoundError:
> > > > org/apache/nutch/net/RegexURLFilter.
> > > >
> > > > Assuming this was a classpath issue, I added
> > > > NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my
> > > > classpath.
> > > >
> > > > This did not solve the problem, as I am still getting the
> > > > NoClassDefFoundError.
> > > >
> > > > So my first question is how to set up my environment correctly for
> > > > testing the regex-urlfilter.
> > > >
> > > > Secondly, I want to tune my regex-urlfilter for maximum relevancy of
> > > > the crawl result. By now, I have around 50 entries. My second question is if
> > > > I can expect any performance impact?
> > > >
> > > > Your help is greatly appreciated.
> > > >
> > > > Kind regards, Thomas Delnoij.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexURLFilter / testing regex-urlfilter.txt

Thomas Delnoij-3
I added this statement right after

 *# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to
$NUTCH_HOME/conf*CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}

Then if you run nutch org.apache.nutch.net.RegexURLFilter you should be able
to test your urls.

I am still not completely satisfied with this answer though. The nutch
script contains the following statement:

 *# add plugins to classpath**if** [* -d "$NUTCH_HOME/plugins"* ]*; *then*
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME*fi*


What is this statement supposed to accomplish? Shouldn't this read something
like

*# add plugins to classpath*
*if** [* -d "$NUTCH_HOME/plugins"* ]*; *then*
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/**/*.jar
*fi*

'Cause now the plugins are not in the classpath, also not after running ant
compile or ant jar,  that copies to $NUTCH_HOME/build/plugins

Rgrds, Thomas



On 12/1/05, Bryan Woliner <[hidden email]> wrote:

>
> Sorry if the answer to this question should be obvious, but where in
> the bin/nutch script do you need to add the following line to be able
> to test your regex-urlfilter.txt file from the command line?
>
> CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-
> regex.jar
>
>
>
> On 11/29/05, Thomas Delnoij <[hidden email]> wrote:
> > For the sake of the archives, I will answer my own question here: I had
> to
> > add the following line to the bin/nutch script to be able to run
> > org.apache.nutch.net.RegexURLFilter from the command line:
> >
> > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-
> regex.jar
> >
> > The nutch script overrides the classpath environment variable, so adding
> the
> > jar there didn't help.
> >
> > Rgrds, Thomas Delnoij
> >
> >
> > On 10/5/05, Thomas Delnoij <[hidden email]> wrote:
> > >
> > > All.
> > >
> > > The problem is actualy a bit different. I was a bit in a hurry when I
> > > posted the previous message, apologies.
> > >
> > > I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
> > >
> > > When I run java org.apache.nutch.net.RegexURLFilter, I am getting
> > >
> > > 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1
> /nutch-
> > > 0.7.1.jar!/nutch-default.xml
> > > 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1
> /nutch-
> > > 0.7.1.jar!/nutch-site.xml
> > > 051005 221040 Plugins: directory not found: plugins
> > > Exception in thread "main" java.lang.ExceptionInInitializerError
> > > Caused by: java.lang.NullPointerException
> > >         at org.apache.nutch.net.RegexURLFilter.<clinit>(
> > > RegexURLFilter.java:64)
> > >
> > > when I run nutch org.apache.nutch.net.RegexURLFilter, I am getting
> > >
> > > Exception in thread "main" java.lang.NoClassDefFoundError:
> > > org/apache/nutch/net/RegexURLFilter
> > >
> > > I know I am missing something obvious, but your help is really
> > > appreciated.
> > >
> > > Kind regards, Thomas Delnoij
> > >
> > >
> > > On 10/5/05, Thomas Delnoij <[hidden email]> wrote:
> > > >
> > > > I was a bit in a hurry when I posted this message, apologies.
> > > >
> > > > The problem is actualy a bit different.
> > > >
> > > > I added both urlfilter-regex.jar and nutch-0.7.1.jar to my
> classpath.
> > > >
> > > > When I run java org.apache.nutch.net.RegexURLFilter,
> > > >
> > > > On 10/5/05, Thomas Delnoij < [hidden email]> wrote:
> > > > >
> > > > > All.
> > > > >
> > > > > I want to run the RegexURLFilter's main() method for testing the
> > > > > regex-urlfilter.txt.
> > > > >
> > > > > I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my
> > > > > environment correctly.
> > > > >
> > > > > When I run nutch org.apache.nutch.net.RegexURLFilter I get
> Exception
> > > > > in thread "main" java.lang.NoClassDefFoundError:
> > > > > org/apache/nutch/net/RegexURLFilter.
> > > > >
> > > > > Assuming this was a classpath issue, I added
> > > > > NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my
> > > > > classpath.
> > > > >
> > > > > This did not solve the problem, as I am still getting the
> > > > > NoClassDefFoundError.
> > > > >
> > > > > So my first question is how to set up my environment correctly for
> > > > > testing the regex-urlfilter.
> > > > >
> > > > > Secondly, I want to tune my regex-urlfilter for maximum relevancy
> of
> > > > > the crawl result. By now, I have around 50 entries. My second
> question is if
> > > > > I can expect any performance impact?
> > > > >
> > > > > Your help is greatly appreciated.
> > > > >
> > > > > Kind regards, Thomas Delnoij.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexURLFilter / testing regex-urlfilter.txt

Doug Cutting-2
In reply to this post by Thomas Delnoij-3
Thomas Delnoij wrote:
> I want to run the RegexURLFilter's main() method for testing the
> regex-urlfilter.txt.

Try instead:

bin/nutch org.apache.nutch.net.URLFilterChecker

Doug