[jira] [Created] (LUCENE-4299) No way to find term vectors options at read time

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
Robert Muir created LUCENE-4299:
-----------------------------------

             Summary: No way to find term vectors options at read time
                 Key: LUCENE-4299
                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Robert Muir


The problem is simple:
# term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
# there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.

So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4299:
--------------------------------

    Attachment: LUCENE-4299.patch

here's a prototype patch: all tests pass.

If we are ok with the idea, i can clean up the rest.
               

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432115#comment-13432115 ]

Robert Muir commented on LUCENE-4299:
-------------------------------------

I would also clean up the merging and checkindex code too... thats the worst and it would become a lot simpler here.
               

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432122#comment-13432122 ]

Robert Muir commented on LUCENE-4299:
-------------------------------------

an alternative is to add this information just to Terms, but then for postings its redundant with FieldInfos. So I don't know if thats any better.
               

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4299:
--------------------------------

    Attachment: LUCENE-4299.patch

ok second idea seems simpler, just adding these to Terms: here's a patch.

I didn't improve tv merging or checkindex yet.
               

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4299.patch, LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4299:
--------------------------------

    Attachment: LUCENE-4299.patch

Updated patch: really simplifies the default TermVectorsWriter.merge impl.
               

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4299:
--------------------------------

    Attachment: LUCENE-4299.patch

updated patch fixing a pretty big inefficiency in highlighter, because its hasPositions(termvectors) was inefficient before, it had to actually clone an indexinput, read term bytes, freqs, positions, offsets, just to see if the first pos was -1.

               

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4299:
--------------------------------

    Attachment: LUCENE-4299.patch

added comparisons of these options in TestDuelingCodecs, and tried to simplify CheckIndex (only slightly) since we know these values up front.

I think this is ready.
               

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (LUCENE-4299) No way to find term vectors options at read time

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-4299.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   5.0
   

> No way to find term vectors options at read time
> ------------------------------------------------
>
>                 Key: LUCENE-4299
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4299
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 5.0, 4.0
>
>         Attachments: LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch, LUCENE-4299.patch
>
>
> The problem is simple:
> # term vectors can be configured "per-field-per-document", meaning for the "body" field, document 0 can have them, document 1 maybe doesnt at all, document 2 maybe has offsets (no positions), and so on. To me this is not a useful feature at all, no one has ever mentioned a single use case for this, and it just makes our code more complicated. but it is what it is (for this issue)
> # there is no way to discover these options for a field of a document, you have to do things like 'peek ahead' to see the first position of the first term is -1, or same for offsets (except worse, we used to allow anything in offsets so -1 might be an actual value). This makes the merging code really hairy, and tough on end consumers.
> So I propose that instead of returning Terms for Vectors, we return VectorTerms (extends Terms), which just adds hasOffsets() and hasPositions(). e.g. lucene40 already knows this from the bits for the field/doc pair and just returns what it knows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]