Distributing index over N disks

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Distributing index over N disks

Otis Gospodnetic-2
Hello,

Would it make sense and be possible to spread different index files over multiple disks (without resorting to putting an index on a RAID)?
For example, what if the index files didn't live in a single index dir, but were organized by their type in a snallow dir tree, like this:

/path/to/index:
   tis/<tis files here>
   ftd/<fdt files here>
   prx/<prx files here>
   ...

Then one could symlink these tis, fdt, prx, etc. dirs to locations that are really on different disks.
Is this doable and would it help imrpve performance?  I think it could improve segment merging, index optimization, and searches, because N disk heads would be able to do ~N times more work because of parallelization.


But the idea seems to simple that it makes me think I'm missing something, otherwise it would have already been done. :)

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Distributing index over N disks

Uwe Schindler
It is technically doable since 2.9 with FileSwitchDirectory, where you can
define file name endings as a filter to which underlying directory the
requests go, see
http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/store/FileSwi
tchDirectory.html

To have more directories, just use another FileSwitchDirectory as secondary
and so on.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Otis Gospodnetic [mailto:[hidden email]]
> Sent: Tuesday, November 24, 2009 11:32 PM
> To: [hidden email]
> Subject: Distributing index over N disks
>
> Hello,
>
> Would it make sense and be possible to spread different index files over
> multiple disks (without resorting to putting an index on a RAID)?
> For example, what if the index files didn't live in a single index dir,
> but were organized by their type in a snallow dir tree, like this:
>
> /path/to/index:
>    tis/<tis files here>
>    ftd/<fdt files here>
>    prx/<prx files here>
>    ...
>
> Then one could symlink these tis, fdt, prx, etc. dirs to locations that
> are really on different disks.
> Is this doable and would it help imrpve performance?  I think it could
> improve segment merging, index optimization, and searches, because N disk
> heads would be able to do ~N times more work because of parallelization.
>
>
> But the idea seems to simple that it makes me think I'm missing something,
> otherwise it would have already been done. :)
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributing index over N disks

Michael McCandless-2
In reply to this post by Otis Gospodnetic-2
I think this is a good idea, for indexes that can't fit in IO cache.
Report back if you get good results :)  I think FSD opens up all sorts
of interesting possibilities.

Mike

On Tue, Nov 24, 2009 at 5:31 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Hello,
>
> Would it make sense and be possible to spread different index files over multiple disks (without resorting to putting an index on a RAID)?
> For example, what if the index files didn't live in a single index dir, but were organized by their type in a snallow dir tree, like this:
>
> /path/to/index:
>   tis/<tis files here>
>   ftd/<fdt files here>
>   prx/<prx files here>
>   ...
>
> Then one could symlink these tis, fdt, prx, etc. dirs to locations that are really on different disks.
> Is this doable and would it help imrpve performance?  I think it could improve segment merging, index optimization, and searches, because N disk heads would be able to do ~N times more work because of parallelization.
>
>
> But the idea seems to simple that it makes me think I'm missing something, otherwise it would have already been done. :)
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributing index over N disks

Andrzej Białecki-2
In reply to this post by Uwe Schindler
Uwe Schindler wrote:
> It is technically doable since 2.9 with FileSwitchDirectory, where you can
> define file name endings as a filter to which underlying directory the
> requests go, see
> http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/store/FileSwi
> tchDirectory.html
>
> To have more directories, just use another FileSwitchDirectory as secondary
> and so on.

You guys are too sophisticated ;) I know some people have been using a
lo-tek solution commonly known as symlinks - i.e. they put prx and frq
files on an SSD and the rest on a regular HDD, and create symlinks to
prx and frq. This works well with static indexes (no updates, no
merges), and doesn't require code modifications in existing apps.

Seriously, though, I agree that FileSwitchDirectory is the way to go.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]