Output of index

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Output of index

Malcolm Clark
Hi,
I'm going to attempt to output several thousand documents from a 3+ million document collection into a csv file.
What is the most efficient method of retrieving all the text from the fields of each document one by one? Please help!
 
Thanks,
Malcolm
Reply | Threaded
Open this post in threaded view
|

Distributed Search

Mark Miller-3
I know there has been a lot of discussion on distributed search...I am
looking for a cross platform solution, which seems to kill solr's
approach...Everyone seems to have implemented this, but only as
proprietary code...it would seem that just using the RMI searcher would
allow a simple solution? Is this the case? Could you easily provide
clustering and fail over using a variety of indexes and searching them
all with RMI searcher? Is it all really that complicated? I have read
that Lucene tops out at about 10m docs for a single server...I want to
hit 100m. I have a beautiful app that allows realtime updating/searching
(updates are rare but should be instant)...and I just want it to scale
up to 100m docs or so . Is that going to be an really advanced project
no matter how I slice it? I have done a lot of custom work with the
lucene stuff so it would seem difficult to adapt it to Nutch (but what
do I know Nutch) ... I have seen a lot of talk but not much on a simple
RMI searcher solution...any idea?


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Output of index

Otis Gospodnetic-2
In reply to this post by Malcolm Clark
I think:
- Get the number of documents from IndexReader.
- Go from 0 to that number.
- If reader.deleted(docId) == false
    get doc
    output doc fields' content

Otis

----- Original Message ----
From: MALCOLM CLARK <[hidden email]>
To: [hidden email]
Sent: Thursday, July 27, 2006 5:01:00 PM
Subject: Output of index

Hi,
I'm going to attempt to output several thousand documents from a 3+ million document collection into a csv file.
What is the most efficient method of retrieving all the text from the fields of each document one by one? Please help!
 
Thanks,
Malcolm



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Search

Otis Gospodnetic-2
In reply to this post by Mark Miller-3
I think we have an RMI example in Lucene in Action.
You could also look at how Nutch does it.  I think the code is in org.apache.nutch.ipc package.
I'm not sure why cross-platform requirement rules out Solr, I would think it would exactly the opposite.
As for 10m limit, it depends.  It depends on the actual size of the index (indexed fields), complexity of queries, required query latency, the hardware you throw at it, etc.  So you can't really say 10m is the limit.  You might have gotten that number from some of the older Nutch docs/presentations, which means they are a few years old now and are Nutch-specific.

Clustering and failover and "easily" don't really go together, in my experience, and this is not limited to Luceneland. :(
I'd love to be wrong about this, but it seems clustering/failover/HA stuff + Lucene always ends up being a custom and propriatory job.

Otis

----- Original Message ----
From: Mark Miller <[hidden email]>
To: [hidden email]
Sent: Thursday, July 27, 2006 5:45:52 PM
Subject: Distributed Search

I know there has been a lot of discussion on distributed search...I am
looking for a cross platform solution, which seems to kill solr's
approach...Everyone seems to have implemented this, but only as
proprietary code...it would seem that just using the RMI searcher would
allow a simple solution? Is this the case? Could you easily provide
clustering and fail over using a variety of indexes and searching them
all with RMI searcher? Is it all really that complicated? I have read
that Lucene tops out at about 10m docs for a single server...I want to
hit 100m. I have a beautiful app that allows realtime updating/searching
(updates are rare but should be instant)...and I just want it to scale
up to 100m docs or so . Is that going to be an really advanced project
no matter how I slice it? I have done a lot of custom work with the
lucene stuff so it would seem difficult to adapt it to Nutch (but what
do I know Nutch) ... I have seen a lot of talk but not much on a simple
RMI searcher solution...any idea?


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Search

Mark Miller-3
Otis Gospodnetic wrote:

> I think we have an RMI example in Lucene in Action.
> You could also look at how Nutch does it.  I think the code is in org.apache.nutch.ipc package.
> I'm not sure why cross-platform requirement rules out Solr, I would think it would exactly the opposite.
> As for 10m limit, it depends.  It depends on the actual size of the index (indexed fields), complexity of queries, required query latency, the hardware you throw at it, etc.  So you can't really say 10m is the limit.  You might have gotten that number from some of the older Nutch docs/presentations, which means they are a few years old now and are Nutch-specific.
>
> Clustering and failover and "easily" don't really go together, in my experience, and this is not limited to Luceneland. :(
> I'd love to be wrong about this, but it seems clustering/failover/HA stuff + Lucene always ends up being a custom and propriatory job.
>
> Otis
>
> ----- Original Message ----
> From: Mark Miller <[hidden email]>
> To: [hidden email]
> Sent: Thursday, July 27, 2006 5:45:52 PM
> Subject: Distributed Search
>
> I know there has been a lot of discussion on distributed search...I am
> looking for a cross platform solution, which seems to kill solr's
> approach...Everyone seems to have implemented this, but only as
> proprietary code...it would seem that just using the RMI searcher would
> allow a simple solution? Is this the case? Could you easily provide
> clustering and fail over using a variety of indexes and searching them
> all with RMI searcher? Is it all really that complicated? I have read
> that Lucene tops out at about 10m docs for a single server...I want to
> hit 100m. I have a beautiful app that allows realtime updating/searching
> (updates are rare but should be instant)...and I just want it to scale
> up to 100m docs or so . Is that going to be an really advanced project
> no matter how I slice it? I have done a lot of custom work with the
> lucene stuff so it would seem difficult to adapt it to Nutch (but what
> do I know Nutch) ... I have seen a lot of talk but not much on a simple
> RMI searcher solution...any idea?
>
>
> - Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  
Thanks for the info Otis. I thought I read that solr requires an OS that
supports hard links and thought that Windows only supports soft links.
Perhaps I am wrong.

Thanks,

- mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Search

Yonik Seeley-2
On 7/27/06, Mark Miller <[hidden email]> wrote:
> I thought I read that solr requires an OS that
> supports hard links and thought that Windows only supports soft links.

For the default index distribution method from master to searcher,
yes, hard-links are currently needed.

The distribution mechanism is *very* loosely coupled with Solr though,
and one could come up with an alternate method.  Also, cygwin might
support hard links to files now (I tried it quickly and it seems to
work) so that might be a path forward on Windows.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Search

jrodenburg
In reply to this post by Mark Miller-3
Hi Mark -

Having gone down this path for the past year, I echo comments from others
that scalability/availability/failover is a lot of work.  We migrated away
from a custom system based on Lucene running on Windows to Solr running on
Linux.  It took us 6 months to get our system to a solid five-nines in
availability.  Having done this previously, I can advise one not to
underestimate the effort involved with this.  We would have taken the simple
route had it been available.

We shifted to Solr because of the operational elements that allows us to
achieve clustering and failover capability within the Linux/Apache/Tomcat
(our flavor) mix.  It just works better for us than our home-brew.

-- j

On 7/27/06, Mark Miller <[hidden email]> wrote:

>
> I know there has been a lot of discussion on distributed search...I am
> looking for a cross platform solution, which seems to kill solr's
> approach...Everyone seems to have implemented this, but only as
> proprietary code...it would seem that just using the RMI searcher would
> allow a simple solution? Is this the case? Could you easily provide
> clustering and fail over using a variety of indexes and searching them
> all with RMI searcher? Is it all really that complicated? I have read
> that Lucene tops out at about 10m docs for a single server...I want to
> hit 100m. I have a beautiful app that allows realtime updating/searching
> (updates are rare but should be instant)...and I just want it to scale
> up to 100m docs or so . Is that going to be an really advanced project
> no matter how I slice it? I have done a lot of custom work with the
> lucene stuff so it would seem difficult to adapt it to Nutch (but what
> do I know Nutch) ... I have seen a lot of talk but not much on a simple
> RMI searcher solution...any idea?
>
>
> - Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>