[jira] [Created] (SOLR-3229) TermVectorComponent does not return terms in distributed search

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
TermVectorComponent does not return terms in distributed search
---------------------------------------------------------------

                 Key: SOLR-3229
                 URL: https://issues.apache.org/jira/browse/SOLR-3229
             Project: Solr
          Issue Type: Bug
          Components: SearchComponents - other
    Affects Versions: 4.0
         Environment: Ubuntu 11.10, openjdk-6
            Reporter: Hang Xie
             Fix For: 4.0


TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hang Xie updated SOLR-3229:
---------------------------

    Attachment: TermVectorComponent.patch

patch to TermVectorComponent.java
               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226677#comment-13226677 ]

Hang Xie commented on SOLR-3229:
--------------------------------

Patch attached, tested in 4.0 environment (both distributed and non-distributed), it should work with 3.x but I didn't test.

Everything is compatible with previous except name of lst, which used to be "doc-<doc id>", and I changed it to Solr Unique Key as former may not be unique in multi-shard environment.
               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hang Xie updated SOLR-3229:
---------------------------

    Attachment: TermVectorComponent.patch

Revised patch, no longer fails unit test.
               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: TermVectorComponent.patch, TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hang Xie updated SOLR-3229:
---------------------------

    Attachment: TermVectorComponent.patch

Revised, it seems distributedProcess() is unnecessary and can be removed, the only major change is add finishStage() to merge responses for subrequests to shards.
               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hang Xie updated SOLR-3229:
---------------------------

    Attachment:     (was: TermVectorComponent.patch)
   

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hang Xie updated SOLR-3229:
---------------------------

    Attachment:     (was: TermVectorComponent.patch)
   

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Assigned] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man reassigned SOLR-3229:
------------------------------

    Assignee: Hoss Man
   

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0-ALPHA
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>            Assignee: Hoss Man
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-3229:
---------------------------

    Attachment: SOLR-3229.patch

Hang: Thank you for your patch.

I agree, the "docid" as a key is dangerous and misleading in distributed mode, and we should switch to using the uniqueKey when available, but if we leave things as you had it in your patch, existing (single node) users who don't have a uniqueKey field would no longer be able to get term vectors at all.

I updated your patch to leave the key alone if there is no uniqueKey, and eliminate the "doc-" prefix when there is one.  I also added a new distributed test to prove that everything is working, and that turned up a few problems - some of which i fixed (dealing with warnings, and ensuring that TVC results are in the correct order for the result documents).

One thing i discovered that i'm not sure about is what to do about the "df" and "tf-idf" values when requested. in the test they have to be ignored because the way the distributed test works is to create a single node instance and compare it with a multi-node instance that has identical documents, and in the distributed TVC code, these won't match up -- but i'm not sure if that's a bug (because the df & tf-idf values aren't "merged" from all nodes) or a feature (because you get the real df & tf-idf values for that term for that doc from the shard it lives in) ... either way it shouldn't stop fixing the basic problem of TVC failing painfully in a distributed request, so i've opened SOLR-3720 to track this in the future.

feedback on this revised patch/test would be appreciated
               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0-ALPHA
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>            Assignee: Hoss Man
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: SOLR-3229.patch, TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430876#comment-13430876 ]

Hang Xie commented on SOLR-3229:
--------------------------------

I use "doc-0" to make it compatible with single node mode so far as I can recall, as my client was expecting that for parser. It's all up to you to keep "doc-" or not - it seems to me if you keep it, you can reduce lots of changes in tests. Other than that I don't have any comment on test thinking of my little to no knowledge on solr's test framework.

I remember I read something regarding df/tf-idf in distributed mode is a highly anticipated feature, I don't expect that can be done easily, I'm good to have a bug there.
               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0-ALPHA
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>            Assignee: Hoss Man
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: SOLR-3229.patch, TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431199#comment-13431199 ]

Hoss Man commented on SOLR-3229:
--------------------------------

bq. I use "doc-0" to make it compatible with single node mode so far as I can recall ...

well, the biggest problem as i mentioned was thta you were *only* including the vector information in the reqponse if there was a uniqueKey value

the format might have looked consistent, and the resulting string values were consistent in the test -- but that was only a fluke of the fact that the uniqueKey values for the test docs were monotomicly increasing integers starting at "0" -- so they just happened to correspond with the internal lucene docids.

i think changing the format to only use the "doc-" when there is not uniqueKeyField in the schema makes the most sense -- both because it helps make it clear when the output key is coming from the uniqueKey instead of the docid, and because moving forward that's the most logical thing for most users (who use a uniqueKey field)


               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0-ALPHA
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>            Assignee: Hoss Man
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: SOLR-3229.patch, TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (SOLR-3229) TermVectorComponent does not return terms in distributed search

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man resolved SOLR-3229.
----------------------------

       Resolution: Fixed
    Fix Version/s: 5.0

Committed revision 1370870. - trunk
Committed revision 1370871. - 4x


               

> TermVectorComponent does not return terms in distributed search
> ---------------------------------------------------------------
>
>                 Key: SOLR-3229
>                 URL: https://issues.apache.org/jira/browse/SOLR-3229
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.0-ALPHA
>         Environment: Ubuntu 11.10, openjdk-6
>            Reporter: Hang Xie
>            Assignee: Hoss Man
>              Labels: patch
>             Fix For: 5.0, 4.0
>
>         Attachments: SOLR-3229.patch, TermVectorComponent.patch
>
>
> TermVectorComponent does not return terms in distributed search, the distributedProcess() incorrectly uses Solr Unique Key to do subrequests, while process() expects Lucene document ids. Also, parameters are transferred in different format thus making distributed search returns no result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]