File system watching for intranets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

File system watching for intranets

Ben Ogle
Hi all, our organization is using nutch on a documentation intranet that changes every now and then. To keep the index up to date, we are recrawling the whole thing every night. For an intranet this seems to be a workaround at best. Our nutch crawler is on the same server as our content and a simpler solution, IMO, would be to monitor file system events and just recrawl the necessary pages each time something changes. That way our index would always be up to date and there would be no reason to do a brute force recrawl every night. I am willing to write this functionality and contribute it to the community as I believe other organizations could benefit from this as well, but since I am not as familiar with nutch as some of the folks here, I have a few questions.

- Is this a solution to a nonexistent problem? I mean, is there a nice solution using the tools already provided? I know each page is time stamped in the database when it is fetched, but does this correspond to the last modified date?

- Could this be done by using the existing generate/fetch/update cycle with a index update? Is there a way to just fetch and index the pages necessary? I suppose my tool could generate the fatch list(s) (I need to look into this more closely).

- Are there any other libraries like JNotify to implement this functionality that anyone knows about? I haven't found any others.

Any input/suggestions/additional questions/whatever on this subject is appreciated as I would like to come up with a more optimal solution for us intranet nutch users.

Ben
Reply | Threaded
Open this post in threaded view
|

Re: File system watching for intranets

Michael Wechner
Ben Ogle wrote:

>Hi all, our organization is using nutch on a documentation intranet that
>changes every now and then. To keep the index up to date, we are recrawling
>the whole thing every night. For an intranet this seems to be a workaround
>at best. Our nutch crawler is on the same server as our content and a
>simpler solution, IMO, would be to monitor file system events and just
>recrawl the necessary pages each time something changes. That way our index
>would always be up to date and there would be no reason to do a brute force
>recrawl every night. I am willing to write this functionality and contribute
>it to the community as I believe other organizations could benefit from this
>as well, but since I am not as familiar with nutch as some of the folks
>here, I have a few questions.
>
>- Is this a solution to a nonexistent problem?
>

I don't think there is any standardized way to do this yet. So every
step into this
direction would be a great improvement.

> I mean, is there a nice
>solution using the tools already provided?
>

not that I am aware of, but I guess other people have tackled this as well.

I think it would be nice to generate a RSS or something similar as
fetchlist which
could also be accessed by other crawlers

> I know each page is time stamped
>in the database when it is fetched, but does this correspond to the last
>modified date?
>  
>

I am still not sure if Nutch is actually comparing the last modifieds. I
know there exists something called
"adddays", but this is more to postpone re-crawling for e.g. 30 days

>- Could this be done by using the existing generate/fetch/update cycle with
>a index update? Is there a way to just fetch and index the pages necessary?
>I suppose my tool could generate the fatch list(s) (I need to look into this
>more closely).
>
>- Are there any other libraries like JNotify to implement this functionality
>that anyone knows about? I haven't found any others.
>  
>

does JNotify also implement protocols, e.g. HTTP? In order to notify
accross networks,
or does it only work locally?

Thanks

Michi

>Any input/suggestions/additional questions/whatever on this subject is
>appreciated as I would like to come up with a more optimal solution for us
>intranet nutch users.
>
>Ben
>  
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[hidden email]                        [hidden email]
+41 44 272 91 61

Reply | Threaded
Open this post in threaded view
|

Re: File system watching for intranets

Ben Ogle
JNotify is only local. A simple mapping of paths to http locations could be provided in some config file to get around that. Also, I figure that in an intranet situation, the admin setting up nutch owns all of the other servers that will need to be fetched from, so (s)he could install nutch on all those machines to run this tool.

So the tool could be setup in a distributed intranet situation:

- admin sets up nutch similar to this: http://wiki.apache.org/nutch/NutchHadoopTutorial
- admin crawls and starts this file watcher tool on each machine that has searchable content

If I use a simple solution such as generating a fetch list when a file is changed (or some amount of time after its changed to catch other changes), then fetching and updating the db, my thought is that the tool would work as follows:

- file changes on a slave node
  - slave node notifies the tool
  - tool starts a map/reduce job to generate fetch list, fetch, update, etc.
  - name node (master node?) would be notified of the change to the file system and index is updated
 
I don't really know how well that would work, though. Can slave nodes can start map/reduce jobs? Should they? Would the task be distributed among the other nodes? Ideally, I suppose, the slave node should react in the following manner:

- file changes on a slave node
  - slave node notifies the tool
  - tool notifies master node of update
  - master node starts map reduce job to do the update
    - this would properly distribute the task of doing the update, right?
   
With this scenario, I am not sure how (or if its possible) to notify the master node.

So maybe it doesn't scale well, but for an intranet such as ours with one machine doing it all (which is probably similar a good majority of intranets) it would provide a nice solution.

I hope there is more commentary on this topic, especially in a distributed environment. I would like to come up with something that works in a good range of intranet configurations.

Ben

Michael Wechner wrote
Ben Ogle wrote:

I don't think there is any standardized way to do this yet. So every
step into this
direction would be a great improvement.

> I mean, is there a nice
>solution using the tools already provided?
>

not that I am aware of, but I guess other people have tackled this as well.

I think it would be nice to generate a RSS or something similar as
fetchlist which
could also be accessed by other crawlers

> I know each page is time stamped
>in the database when it is fetched, but does this correspond to the last
>modified date?

I am still not sure if Nutch is actually comparing the last modifieds. I
know there exists something called
"adddays", but this is more to postpone re-crawling for e.g. 30 days

>- Could this be done by using the existing generate/fetch/update cycle with
>a index update? Is there a way to just fetch and index the pages necessary?
>I suppose my tool could generate the fatch list(s) (I need to look into this
>more closely).
>
>- Are there any other libraries like JNotify to implement this functionality
>that anyone knows about? I haven't found any others.
>  
>

does JNotify also implement protocols, e.g. HTTP? In order to notify
accross networks,
or does it only work locally?

Thanks

Michi