different regex-urlfilter.txt files for different sets of URLs?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
S L
Reply | Threaded
Open this post in threaded view
|

different regex-urlfilter.txt files for different sets of URLs?

S L
Hi,

I need to have different regex-urlfilter.txt files for different crawls.
Since the file lives in conf and I don't see a way to point nutch inject to
a different file or a different conf directory, I assume I should just swap
in a different regex-urlfilter.txt file every time I do a crawl.

Does that sound right?

Thanks.

Sol
Reply | Threaded
Open this post in threaded view
|

Re: different regex-urlfilter.txt files for different sets of URLs?

Sebastian Nagel
Hi Sol,

of course, you could provide a separate package for every crawl.

In local mode, it's easier to point NUTCH_CONF_DIR to the right directory,
could be even a hierarchy of folders to search for config files separated
by ':' (config files are actually searched on the Java classpath)
E.g., one could define a shell function for Nutch, e.g.
 nutch () {
    NUTCH_LOG_DIR=./logs NUTCH_CONF_DIR=./conf:$NUTCH_HOME/conf $NUTCH_HOME/bin/nutch "$@"
 }

Every config file in ./conf/ is taken first (usually nutch-site.xml) before those
from $NUTCH_HOME/conf/.

For your specific use case, see also:

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

This would also work in cluster mode as you can set/overwrite properties
from command-line when launching Nutch.

Sebastian

On 11/08/2017 03:55 PM, Sol Lederman wrote:

> Hi,
>
> I need to have different regex-urlfilter.txt files for different crawls.
> Since the file lives in conf and I don't see a way to point nutch inject to
> a different file or a different conf directory, I assume I should just swap
> in a different regex-urlfilter.txt file every time I do a crawl.
>
> Does that sound right?
>
> Thanks.
>
> Sol
>

S L
Reply | Threaded
Open this post in threaded view
|

Re: different regex-urlfilter.txt files for different sets of URLs?

S L
Awesome! Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: different regex-urlfilter.txt files for different sets of URLs?

Rushikesh K
Hi Sol,
i have a question we are trying to use Nutch 1.3 for our website crawling
,we have a requirement of skipping the header and footer .I was searching
online but there isnt an exact solution i found.Can you please guide us
through that.

I really appreciate you in advance!

On Thu, Nov 9, 2017 at 11:23 AM, Sol Lederman <[hidden email]>
wrote:

> Awesome! Thank you.
>



--
Regards
Rushikesh M
.Net Developer
S L
Reply | Threaded
Open this post in threaded view
|

Re: different regex-urlfilter.txt files for different sets of URLs?

S L
Hi Rushikesh,

I'm very new to Nutch. I'll let Sebastian and the other experts guide you.
I suspect that success in removing the header and footer will be very
dependent on the HTML files you're processing.

A quick Google search finds these pages:

http://grokbase.com/t/nutch/user/155ensey7k/parsing-pages-but-removing-headers-and-footers
http://grokbase.com/t/nutch/user/1563bdhv85/crawling-pages-but-ignoring-header-and-footer
http://lucene.472066.n3.nabble.com/Removing-Common-Web-Page-Header-and-Footer-from-content-td4168764.html


I suggest you start a new thread since I don't believe your question has
anything to do with this regex-urlfilter.txt discussion.

I also suggest that you try to implement what is suggested in those pages
and then write back (in a new discussion thread) what you did and what
isn't working.

Sol

On Thu, Nov 9, 2017 at 11:02 AM, Rushikesh K <[hidden email]>
wrote:

> Hi Sol,
> i have a question we are trying to use Nutch 1.3 for our website crawling
> ,we have a requirement of skipping the header and footer .I was searching
> online but there isnt an exact solution i found.Can you please guide us
> through that.
>
> I really appreciate you in advance!
>
> On Thu, Nov 9, 2017 at 11:23 AM, Sol Lederman <[hidden email]>
> wrote:
>
> > Awesome! Thank you.
> >
>
>
>
> --
> Regards
> Rushikesh M
> .Net Developer
>