Problem indexing Files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem indexing Files

D.Saravanaraj
Hi i am using nutch to index files in local FS and FTP.

my filter file is

-^(http|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$
-[?*!@=]
-.*(/.+?)/.*?\1/.*?\1/
+^file:/E:/Index Samples/
-^file:/E:/Index Samples/Index/

but nutch crawls the forbidden folders also. is there a web db kind of thing
for files also. is it possible to make nutch to index files based on the
last modified date.

can anybody suggest the datastructure for webdb (filedb??) for files. it
will be good to group files and create seperate segements for each group. so
if some files are changed, only those segments can be replaced.

Rgds,
D.Saravanaraj
Reply | Threaded
Open this post in threaded view
|

Re: Problem indexing Files

Gal Nitzan
Make sure you add -. at the end of your regex file to disallow anything
else.

On Mon, 2006-02-06 at 09:03 +0530, Saravanaraj Duraisamy wrote:

> Hi i am using nutch to index files in local FS and FTP.
>
> my filter file is
>
> -^(http|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$
> -[?*!@=]
> -.*(/.+?)/.*?\1/.*?\1/
> +^file:/E:/Index Samples/
> -^file:/E:/Index Samples/Index/
>
> but nutch crawls the forbidden folders also. is there a web db kind of thing
> for files also. is it possible to make nutch to index files based on the
> last modified date.
>
> can anybody suggest the datastructure for webdb (filedb??) for files. it
> will be good to group files and create seperate segements for each group. so
> if some files are changed, only those segments can be replaced.
>
> Rgds,
> D.Saravanaraj