Term Freq Vector with SOLR cell?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Term Freq Vector with SOLR cell?

Geoffrey Willis
I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema:

<field name="doc_content" type="text_ws" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true”/>

And use a get request similar to :

   http://localhost:8983/solr/myCore/tvrh?q=doc_content&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=true&tv.payload
  s=true&tv.fl=includes

When I look in the browser network tab, I see that the query went in as expected with tv=true, tv.positions= true etc. But no Term Positions/Offsets in the results. I’ve done similar using the Data Import Handler with java, but looking for a web solution. Before I “Roll my own” Term Vector, thought I’d see if it’s available from Solr Cell.
Reply | Threaded
Open this post in threaded view
|

Re: Term Freq Vector with SOLR cell?

Erik Hatcher-4
q=doc_content?    Try q=id:"<some known id that you've indexed>"

Solr Cell and DIH are comparable (in that they are about getting content into Solr) but "unrelated" to TVRH.   TVRH is about inspecting indexed content, regardless of how it got in.

        Erik


> On May 1, 2019, at 3:14 PM, Geoffrey Willis <[hidden email]> wrote:
>
> I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema:
>
> <field name="doc_content" type="text_ws" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true”/>
>
> And use a get request similar to :
>
>   http://localhost:8983/solr/myCore/tvrh?q=doc_content&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=true&tv.payload
>  s=true&tv.fl=includes
>
> When I look in the browser network tab, I see that the query went in as expected with tv=true, tv.positions= true etc. But no Term Positions/Offsets in the results. I’ve done similar using the Data Import Handler with java, but looking for a web solution. Before I “Roll my own” Term Vector, thought I’d see if it’s available from Solr Cell.

Reply | Threaded
Open this post in threaded view
|

Re: Term Freq Vector with SOLR cell?

Geoffrey Willis
Thanks for the response. The tvrh I got off a google search and the doc_content was meant as a filter. The actual query I’m using is:

A screen grab of the response headers :



So it appears that term vector and term positions are set, but not included in the response. My doc_content field was modified as shown earlier to enable storing these attributes when indexing. I get the doc_contents data (Text extracted by Tika), just not the TermFreq data I need such as Offset, and Positions. Thanks for any help. 
Geoff



On May 1, 2019, at 4:52 PM, Erik Hatcher <[hidden email]> wrote:

q=doc_content?    Try q=id:"<some known id that you've indexed>"

Solr Cell and DIH are comparable (in that they are about getting content into Solr) but "unrelated" to TVRH.   TVRH is about inspecting indexed content, regardless of how it got in.

Erik


On May 1, 2019, at 3:14 PM, Geoffrey Willis <[hidden email]> wrote:

I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema:

<field name="doc_content" type="text_ws" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true”/>

And use a get request similar to :

 http://localhost:8983/solr/myCore/tvrh?q=doc_content&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=true&tv.payload
s=true&tv.fl=includes

When I look in the browser network tab, I see that the query went in as expected with tv=true, tv.positions= true etc. But no Term Positions/Offsets in the results. I’ve done similar using the Data Import Handler with java, but looking for a web solution. Before I “Roll my own” Term Vector, thought I’d see if it’s available from Solr Cell.