lucene/nutch question...

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

lucene/nutch question...

aaaaa
Hi.

Got a very basic lucene/nutch question.

Assume I have a page that has a form. Within the form are a number of
select/drop-down boxes/etc... In this case, each object would comprise a
variable which would form part of the query string as defined in the form
action. Is there a way for lucene/nutch to go through the process of
building up the actions based on the querystring vars, so that lucene/nutch
can actually search through each possible combination of urls....

Also, is nutch/lucene the right/correct app to use in this scenario? Is
there a better app to handle this kind of potential application/process.

Thanks

-bruce






Reply | Threaded
Open this post in threaded view
|

Re: lucene/nutch question...

brainstorm-2-2
If I understand correctly, you are looking for a way to test/fill
forms... if that's the case, I recommend the following tools:

http://wtr.rubyforge.org/
http://search.cpan.org/~petdance/WWW-Mechanize-1.34/lib/WWW/Mechanize.pm

But I guess that with coding effort, nutch can also archieve what you want.

Regards,
Roman

On Thu, Aug 14, 2008 at 11:51 PM, bruce <[hidden email]> wrote:

> Hi.
>
> Got a very basic lucene/nutch question.
>
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls....
>
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
>
> Thanks
>
> -bruce
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: lucene/nutch question...

aaaaa
Hi Roman...

umm no. assume you have a web page, and the page has a form on it. within the form, there might be multiple elements (lists/select statements, etc...). each item would have a varname, which would in turn be used as part of the form action, to create the entire query...

sort of like:
form action=test.php?
 option
  name=foo
  foo=1
  foo=2
  foo=3
  foo=4
 /option

 option
  name=cat
  cat=1
  cat=2
  cat=3
 /option
/form

so you'd get the following urls in this psuedo example:
 test.php?foo=1&cat=1
 test.php?foo=1&cat=2
 test.php?foo=1&cat=3
 test.php?foo=2&cat=1
 test.php?foo=2&cat=2
 test.php?foo=2&cat=3
 test.php?foo=3&cat=1
 test.php?foo=3&cat=2
 test.php?foo=3&cat=3
 test.php?foo=4&cat=1
 test.php?foo=4&cat=2
 test.php?foo=4&cat=3

with this, the app can then continue to crawl the pages. so, i'm looking for some sort of crawler that already does this kind of analysis within the page.

i know i can create a python/perl script for a sing site/page.. but since i'm looking at 100s of sites...

this is why i'm asking about nutch/lucene/solr...

thanks


-----Original Message-----
From: brainstorm [mailto:[hidden email]]
Sent: Thursday, August 14, 2008 3:12 PM
To: [hidden email]
Subject: Re: lucene/nutch question...


If I understand correctly, you are looking for a way to test/fill
forms... if that's the case, I recommend the following tools:

http://wtr.rubyforge.org/
http://search.cpan.org/~petdance/WWW-Mechanize-1.34/lib/WWW/Mechanize.pm

But I guess that with coding effort, nutch can also archieve what you want.

Regards,
Roman

On Thu, Aug 14, 2008 at 11:51 PM, bruce <[hidden email]> wrote:

> Hi.
>
> Got a very basic lucene/nutch question.
>
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls....
>
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
>
> Thanks
>
> -bruce
>
>
>
>
>
>
>