adding new URLs to nutch index

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

adding new URLs to nutch index

Dima Gritsenko
Hi,

We are indexing DMOZ + we want to add too other URLs for indexing and seem to have a problem searching those 2 newly added URLs (no results returned).
Here's what we do to add new URL to nutch index:
1) Created a dir  /url with "url" file that contains these two URLs:
    http://www.newsvine.com/_feeds/rss2/index
    http://www.technorati.com/blogs/

2) Then the following command is run (it should be adding our extra URLs to nutch DB/index)
    bin/nutch inject crawl/crawldb urls

3) Then start recrawl
    bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/ 3 0
 
We are also using index-url-category plugin that ascribes URLs to different categories for future filtered search:
Here's what we do:

Add patterns used in regex-urlfilter.txt

# accept anything else
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*

-.

Add patterns used in crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*


# skip everything else
-.


Patterns used in index-url-category plugin

rules.properties file

# News
http://newsrss.bbc.co.uk/rss/*=news
http://www.newsvine.com/*=news
.*rss.*=news
.*\.xml=news

# Blogs
.*technorati\.com\/blogs.*=blogs

# Web
.*=web

Thank you.
Dima.


Reply | Threaded
Open this post in threaded view
|

RE: adding new URLs to nutch index

Vishal Shah-3
Hi Dima,

  Which version of Nutch are you using? From 0.8 onwards, the name of
the urls file has to be urls.txt, and it's parent dir has to be passed
to inject. For e.g., if your urls.txt is in a dir called NewUrls, then
your inject cmd would be:

bin/nutch inject crawl/crawldb NewUrls

Also, check your crawl-urlfilter.txt to make sure that these new URLs
won't be filtered.

Regards,

-vishal.

-----Original Message-----
From: Dima Gritsenko [mailto:[hidden email]]
Sent: Monday, September 04, 2006 3:36 PM
To: [hidden email]
Subject: adding new URLs to nutch index

Hi,

We are indexing DMOZ + we want to add too other URLs for indexing and
seem to have a problem searching those 2 newly added URLs (no results
returned).
Here's what we do to add new URL to nutch index:
1) Created a dir  /url with "url" file that contains these two URLs:
    http://www.newsvine.com/_feeds/rss2/index
    http://www.technorati.com/blogs/

2) Then the following command is run (it should be adding our extra URLs
to nutch DB/index)
    bin/nutch inject crawl/crawldb urls

3) Then start recrawl
    bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/
3 0
 
We are also using index-url-category plugin that ascribes URLs to
different categories for future filtered search:
Here's what we do:

Add patterns used in regex-urlfilter.txt

# accept anything else
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*

-.

Add patterns used in crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http:\/\/www\.technorati\.com\/blogs.*
+.*rss.*


# skip everything else
-.


Patterns used in index-url-category plugin

rules.properties file

# News
http://newsrss.bbc.co.uk/rss/*=news
http://www.newsvine.com/*=news
.*rss.*=news
.*\.xml=news

# Blogs
.*technorati\.com\/blogs.*=blogs

# Web
.*=web

Thank you.
Dima.



Reply | Threaded
Open this post in threaded view
|

Re: adding new URLs to nutch index

Dima Gritsenko
Thank you, Vishal.
This part is working good now. Still figuring out why URLs have not been
properly categorized though.

Dima.

----- Original Message -----
From: "Vishal Shah" <[hidden email]>
To: <[hidden email]>
Sent: Monday, September 04, 2006 5:23 AM
Subject: RE: adding new URLs to nutch index


> Hi Dima,
>
>   Which version of Nutch are you using? From 0.8 onwards, the name of
> the urls file has to be urls.txt, and it's parent dir has to be passed
> to inject. For e.g., if your urls.txt is in a dir called NewUrls, then
> your inject cmd would be:
>
> bin/nutch inject crawl/crawldb NewUrls
>
> Also, check your crawl-urlfilter.txt to make sure that these new URLs
> won't be filtered.
>
> Regards,
>
> -vishal.
>
> -----Original Message-----
> From: Dima Gritsenko [mailto:[hidden email]]
> Sent: Monday, September 04, 2006 3:36 PM
> To: [hidden email]
> Subject: adding new URLs to nutch index
>
> Hi,
>
> We are indexing DMOZ + we want to add too other URLs for indexing and
> seem to have a problem searching those 2 newly added URLs (no results
> returned).
> Here's what we do to add new URL to nutch index:
> 1) Created a dir  /url with "url" file that contains these two URLs:
>     http://www.newsvine.com/_feeds/rss2/index
>     http://www.technorati.com/blogs/
>
> 2) Then the following command is run (it should be adding our extra URLs
> to nutch DB/index)
>     bin/nutch inject crawl/crawldb urls
>
> 3) Then start recrawl
>     bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/
> 3 0
>
> We are also using index-url-category plugin that ascribes URLs to
> different categories for future filtered search:
> Here's what we do:
>
> Add patterns used in regex-urlfilter.txt
>
> # accept anything else
> +^http:\/\/www\.technorati\.com\/blogs.*
> +.*rss.*
>
> -.
>
> Add patterns used in crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http:\/\/www\.technorati\.com\/blogs.*
> +.*rss.*
>
>
> # skip everything else
> -.
>
>
> Patterns used in index-url-category plugin
>
> rules.properties file
>
> # News
> http://newsrss.bbc.co.uk/rss/*=news
> http://www.newsvine.com/*=news
> .*rss.*=news
> .*\.xml=news
>
> # Blogs
> .*technorati\.com\/blogs.*=blogs
>
> # Web
> .*=web
>
> Thank you.
> Dima.
>
>
>
>