[jira] Created: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
[PATCH] Indexing on Hadoop distributed file system
--------------------------------------------------

         Key: LUCENE-532
         URL: http://issues.apache.org/jira/browse/LUCENE-532
     Project: Lucene - Java
        Type: Improvement
  Components: Index  
    Versions: 1.9    
    Reporter: Igor Bolotin
    Priority: Minor
 Attachments: SegmentTermEnum.java, TermInfosWriter.java

In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
 
Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
 
TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
 
With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-532?page=all ]

Igor Bolotin updated LUCENE-532:
--------------------------------

    Attachment: TermInfosWriter.java
                SegmentTermEnum.java

Two patch files are attached

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>          Key: LUCENE-532
>          URL: http://issues.apache.org/jira/browse/LUCENE-532
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Index
>     Versions: 1.9
>     Reporter: Igor Bolotin
>     Priority: Minor
>  Attachments: SegmentTermEnum.java, TermInfosWriter.java
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-532?page=all ]

Igor Bolotin updated LUCENE-532:
--------------------------------

    Attachment: TermInfosWriter.patch
                SegmentTermEnum.patch

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>          Key: LUCENE-532
>          URL: http://issues.apache.org/jira/browse/LUCENE-532
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Index
>     Versions: 1.9
>     Reporter: Igor Bolotin
>     Priority: Minor
>  Attachments: SegmentTermEnum.java, SegmentTermEnum.patch, TermInfosWriter.java, TermInfosWriter.patch
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12372044 ]

Doug Cutting commented on LUCENE-532:
-------------------------------------

Instead of changing the value to -1 we should not write a size value in the header at all.  We can change the format number and use that to determine where to read the size.  Does that make sense?

Also, please submit patches as a single 'svn diff' from the top of the lucene tree.  

Thanks!

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>          Key: LUCENE-532
>          URL: http://issues.apache.org/jira/browse/LUCENE-532
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Index
>     Versions: 1.9
>     Reporter: Igor Bolotin
>     Priority: Minor
>  Attachments: SegmentTermEnum.java, SegmentTermEnum.patch, TermInfosWriter.java, TermInfosWriter.patch
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-532?page=all ]

Igor Bolotin updated LUCENE-532:
--------------------------------

    Attachment: indexOnDFS.patch

Attached is new patch which is using format number to determine where to read the size as discussed.
Thanks!



> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>          Key: LUCENE-532
>          URL: http://issues.apache.org/jira/browse/LUCENE-532
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Index
>     Versions: 1.9
>     Reporter: Igor Bolotin
>     Priority: Minor
>  Attachments: SegmentTermEnum.java, SegmentTermEnum.patch, TermInfosWriter.java, TermInfosWriter.patch, indexOnDFS.patch
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12424276 ]
           
Otis Gospodnetic commented on LUCENE-532:
-----------------------------------------

This actually looks like a good and patch that doesn't break any tests.  I'll commit it in the coming days, as it looks like it should be backwards compatible... except CFS won't be supported unless somebody patches that, too (I tried quickly and soon got unit tests to fail :( ).

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-532?page=all ]

Otis Gospodnetic updated LUCENE-532:
------------------------------------

    Attachment:     (was: SegmentTermEnum.java)

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-532?page=all ]

Otis Gospodnetic updated LUCENE-532:
------------------------------------

    Attachment:     (was: TermInfosWriter.java)

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Chris Hostetter-3
In reply to this post by Michael Gibney (Jira)

: This actually looks like a good and patch that doesn't break any tests.
: I'll commit it in the coming days, as it looks like it should be
: backwards compatible... except CFS won't be supported unless somebody
: patches that, too (I tried quickly and soon got unit tests to fail :( ).

The one thing the Unit tests can't verify is that the version checking is
working (ie: that the new code will still read old indexes properly).  The
patch looks fairly straight forward to me so I don't think that will be a
problem -- but you may want to try searching an index built with 2.0 after
you apply the patch just to sanity check it.

This will neccessitate "Lucene 2.1" being the next version released from
the trunk because of the file format change correct? ... i have no
objection to that, i just want to verify my understanding of the file
format versioning.



:
: > [PATCH] Indexing on Hadoop distributed file system
: > --------------------------------------------------
: >
: >                 Key: LUCENE-532
: >                 URL: http://issues.apache.org/jira/browse/LUCENE-532
: >             Project: Lucene - Java
: >          Issue Type: Improvement
: >          Components: Index
: >    Affects Versions: 1.9
: >            Reporter: Igor Bolotin
: >            Priority: Minor
: >         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
: >
: >
: > In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
: >
: > Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
: >
: > TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
: >
: > With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.
:
: --
: This message is automatically generated by JIRA.
: -
: If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
: -
: For more information on JIRA, see: http://www.atlassian.com/software/jira
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12433852 ]
           
Chris commented on LUCENE-532:
------------------------------

Don't mean to resurrect old issues, but we're having the same problem here indexing to DFS and I've applied the patch and it works for us. Wondering if I'm missing something, or if this is being addressed somewhere else in trunk that I haven't found.

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12435583 ]
           
Otis Gospodnetic commented on LUCENE-532:
-----------------------------------------

I'm hesitant to commit without the CFS support.  It looks like more and more people are using CFS indexes.

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12448770 ]
           
Michael McCandless commented on LUCENE-532:
-------------------------------------------

I think this is the same issue as LUCENE-532 (I just marked that one as a dup).

But there was one difference: does HDFS allow writing to the same file (eg "segments") more than once?  I thought it did not because it's "write once"?  Do we need to not do that (write to the same file more than once) to work with HDFS (lock-less gets us closer)?

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12448774 ]
           
Michael McCandless commented on LUCENE-532:
-------------------------------------------

Sorry, I meant "dup of LUCENE-704 " above.

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12448780 ]
           
Michael McCandless commented on LUCENE-532:
-------------------------------------------

Also: I like the idea of never doing "seek" when writing.  The less functionality we rely on from the filesystem, the more portable Lucene will be.  Since Lucene is so wonderfully simple, never using "seek" during write is in fact very feasible.

I think to do this we need to change the CFS file format, so that the offsets are stored at the end of the file.  We actually can't pre-compute where the offsets will be because we can't make assumptions about how the file position changes when bytes are written: this is implementation specific.  For example, if the Directory implementation does on-the-fly compression, then the file position will not be the number of bytes written.  So I think we have to write at the end of the file.

Any opinions or other suggestions?

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12448877 ]
           
Andrzej Bialecki  commented on LUCENE-532:
------------------------------------------

Hadoop cannot (yet) change file position when writing. All files are write-once, i.e. once they are closed they are pretty much immutable. They are also append-only - writing uses a subclass of OutputStream.

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12448989 ]
           
Michael McCandless commented on LUCENE-532:
-------------------------------------------

Alas, in trying to change the CFS format so that file offsets are stored at the end of the file, when implementing the corresponding changes to CompoundFileReader, I discovered that this approach isn't viable.  I had been thinking the reader would look at the file length, subtract numEntry*sizeof(long), seek to there, and then read the offsets (longs).  The problem is: we can't know sizeof(long) since this is dependent on the actual storage implementation, ie, for the same reasoning above.  Ie we can't assume a byte = 1 file position, always.

So, then, the only solution I can think of (to avoid seek during write) would be to write to a separate file, for each *.cfs file, that contains the file offsets corresponding to the cfs file.  Eg, if we have _1.cfs we would also have _1.cfsx which holds the file offsets.   This is sort of costly if we care about # files (it doubles the number of files in the simple case of a bunch of segments w/ no deletes/separate norms).

Yonik had actually mentioned in LUCENE-704 that fixing CFS writing to not use seek was not very important, ie, it would be OK to not use compound files with HDFS as the store.

Does anyone see a better approach?

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Kevin Oliver
We considered patching this code when we ran into a data consistency
issue bug with our file system. It wasn't too difficult to patch
CompoundFileWriter to output lengths instead of offsets.

Naturally, I can't seem to find my implementation of this, but as I
recall it wasn't too difficult to do. I think it also required creating
a format version number at the beginning of the cfs file so that
CompoundFileReader knew what the data format was.

Anyways, just want to point out that there was an easy enough workaround
for this that didn't require creating a new set of files.

-----Original Message-----
From: Michael McCandless (JIRA) [mailto:[hidden email]]
Sent: Saturday, November 11, 2006 8:04 AM
To: [hidden email]
Subject: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop
distributed file system

    [
http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_124
48989 ]
           
Michael McCandless commented on LUCENE-532:
-------------------------------------------

Alas, in trying to change the CFS format so that file offsets are stored
at the end of the file, when implementing the corresponding changes to
CompoundFileReader, I discovered that this approach isn't viable.  I had
been thinking the reader would look at the file length, subtract
numEntry*sizeof(long), seek to there, and then read the offsets (longs).
The problem is: we can't know sizeof(long) since this is dependent on
the actual storage implementation, ie, for the same reasoning above.  Ie
we can't assume a byte = 1 file position, always.

So, then, the only solution I can think of (to avoid seek during write)
would be to write to a separate file, for each *.cfs file, that contains
the file offsets corresponding to the cfs file.  Eg, if we have _1.cfs
we would also have _1.cfsx which holds the file offsets.   This is sort
of costly if we care about # files (it doubles the number of files in
the simple case of a bunch of segments w/ no deletes/separate norms).

Yonik had actually mentioned in LUCENE-704 that fixing CFS writing to
not use seek was not very important, ie, it would be OK to not use
compound files with HDFS as the store.

Does anyone see a better approach?

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch,
TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene
indexes on Hadoop distributed file system. When we tried to do it
directly on DFS using Nutch FsDirectory class - we immediately found
that indexing fails because DfsIndexOutput.seek() method throws
UnsupportedOperationException. The reason for this behavior is clear -
DFS does not support random updates and so seek() method can't be
supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we
really need them? Search in the Lucene code revealed 2 places which call
IndexOutput.seek() method: one is in TermInfosWriter and another one in
CompoundFileWriter. As we weren't planning to use CompoundFileWriter -
the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write
total number of terms in the file back into the beginning of the file.
It was very simple to change file format a little bit and write number
of terms into last 8 bytes of the file instead of writing them into
beginning of file. The only other place that should be fixed in order
for this to work is in SegmentTermEnum constructor - to read this piece
of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index
directly to DFS without any problems. Well - we still don't index
directly to DFS for performance reasons, but at least we can build small
local indexes and merge them into the main index on DFS without copying
big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-532?page=all ]

Kevin Oliver updated LUCENE-532:
--------------------------------

    Attachment: cfs-patch.txt

Here are some diffs on how to remove seeks from CompoundFileWriter (this is against an older version of Lucene, 1.4.2 I think, but the general idea is the same). There's also a test too.

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12449446 ]
           
Michael McCandless commented on LUCENE-532:
-------------------------------------------

Thank you for the patch & unit test!

This is actually the same approach that I started with.  But I ruled
it out because I don't think it's safe to do arithmetic (ie, adding
lengths to compute positions) on file positions.

Meaning, one can imagine a Directory implementation that's doing some
kind of compression where on writing N bytes the file position does
not in fact advance by N bytes.  Or maybe an implementation that must
escape certain bytes, or it's writing to XML or using some kind of
alternate coding system, or something along these lines.  I don't know
if such Directory implementations exist today, but, I don't want to
break them if they do nor preclude them in the future.

And so the only value you should ever pass to "seek()" is a value you
previously obtained by calling "getFilePosition()".  The current
javadocs for these methods seem to imply this.

However, on looking into this question further ... I do see that there
are places now where Lucene already does arithmetic on file positions.
For example in accessing a *.fdx file or *.tdx file we assume we can
find a given entry at FORMAT_SIZE + 8 * index file position.

Maybe it is OK to make the definition of Directory.seek() stricter, by
requiring that in fact the position we pass to seek is always the same
as "the number of bytes written", thereby allowing us to do arithmetic
based on bytes/length and call seek with such values?  I'm nervous
about making this API change.

I think this is the open question.  Does anyone have any input to help
answer this question?

Lucene currently makes this assumption, albeit in a fairly contained
way I think (most other calls to seek seem to be values previously
obtained by getFilePosition()).


> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558412#action_12558412 ]

Grant Ingersoll commented on LUCENE-532:
----------------------------------------

Anyone have a follow up on this?  Seems like Hadoop based indexing would be a nice feature.  It sounds like there was a lot of support for this, but it was never committed.  Is this still an issue?

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on Hadoop distributed file system. When we tried to do it directly on DFS using Nutch FsDirectory class - we immediately found that indexing fails because DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason for this behavior is clear - DFS does not support random updates and so seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need them? Search in the Lucene code revealed 2 places which call IndexOutput.seek() method: one is in TermInfosWriter and another one in CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total number of terms in the file back into the beginning of the file. It was very simple to change file format a little bit and write number of terms into last 8 bytes of the file instead of writing them into beginning of file. The only other place that should be fixed in order for this to work is in SegmentTermEnum constructor - to read this piece of information at position = file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index directly to DFS without any problems. Well - we still don't index directly to DFS for performance reasons, but at least we can build small local indexes and merge them into the main index on DFS without copying big main index back and forth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12