Offset Questions

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Offset Questions

Steve Suppe-2
Hi all,

I'm trying to index documents so that a) I have all the documents indexed
'normally' (in that I can search for documents that match certain words,
and b) parts of the document that I consider important, such as author and
title are ALSO stored in their own indexed fields.

I have (a) working fine, and (b) is almost working - however, I'm trying to
force the separate field to have the original offsets of where it existed
in the text.  As in, if the title was at characters 76-200 in the original
text, I'd like the field to have that as its information, so when I look at
the field I can find the place in the document quickly.

I don't seem to be able to do this - I have my own analyzer that finds the
tokens and sets the start and end offsets accordingly.  However, when I
create the new field and write it to the index, it seems like these offsets
are ignored?  When I pull offsets out later, they start at 0 and move up
from there.

I am creating the field like:

CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
analyzer.addAnalyzer(info.indexName, psa);

TokenStream ts = psa.tokenStream(info.indexName,
                                              new StringReader(info.value));
Field stemF = new Field(info.indexName, ts,
                                     Field.TermVector.WITH_POSITIONS_OFFSETS);
d.add(stemF);

(d is the document being indexed).

I have tried various permutations of creating the field and token stream -
does anyone have any insights, please?

Thanks in advance,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Offset Questions (Follow-Up)

Steve Suppe-2
OK, I think I understand what's going on - it looks like I am able to set
the token for the full author name (Say, "Steve Suppe") with the correct
offsets, but the analyzer takes it once step further and tokenizes 'Steve'
and 'Suppe' which is giving me a lot more generated offsets and is
confusing me.

I like the tokenization, as it allows me to just search for Suppe and get
results.  However, I don't want those "sub-offsets" returned.  Is there a
way to distinguish the 'main' offsets for the whole field?

Thanks again,
Steve

At 10:38 AM 3/7/2008, you wrote:

>Hi all,
>
>I'm trying to index documents so that a) I have all the documents indexed
>'normally' (in that I can search for documents that match certain words,
>and b) parts of the document that I consider important, such as author and
>title are ALSO stored in their own indexed fields.
>
>I have (a) working fine, and (b) is almost working - however, I'm trying
>to force the separate field to have the original offsets of where it
>existed in the text.  As in, if the title was at characters 76-200 in the
>original text, I'd like the field to have that as its information, so when
>I look at the field I can find the place in the document quickly.
>
>I don't seem to be able to do this - I have my own analyzer that finds the
>tokens and sets the start and end offsets accordingly.  However, when I
>create the new field and write it to the index, it seems like these
>offsets are ignored?  When I pull offsets out later, they start at 0 and
>move up from there.
>
>I am creating the field like:
>
>CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
>analyzer.addAnalyzer(info.indexName, psa);
>
>TokenStream ts = psa.tokenStream(info.indexName,
>                                              new StringReader(info.value));
>Field stemF = new Field(info.indexName, ts,
>                                     Field.TermVector.WITH_POSITIONS_OFFSETS);
>d.add(stemF);
>
>(d is the document being indexed).
>
>I have tried various permutations of creating the field and token stream -
>does anyone have any insights, please?
>
>Thanks in advance,
>Steve
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Offset Questions

Erick Erickson
In reply to this post by Steve Suppe-2
What is your analyzer doing? Let's assume you're trying
to index the title and that your entire text is

"this is a book and HERE IS THE TITLE."

I *think* your underlying analyzer should be returning
4 tokens with starts of 20 for HERE, 25 for IS,
28 for THE and 32 for TITTLE, with appropriate  ends.
Is that what's happening? And perhaps

If the value you're passing in to the analyzer is just the
title and not the entire text, what you report seems
perfectly reasonable to me....

But I haven't worked with this very much so take
this with the appropriate grain of salt...

Best
Erick


On Fri, Mar 7, 2008 at 1:38 PM, Steve Suppe <[hidden email]> wrote:

> Hi all,
>
> I'm trying to index documents so that a) I have all the documents indexed
> 'normally' (in that I can search for documents that match certain words,
> and b) parts of the document that I consider important, such as author and
> title are ALSO stored in their own indexed fields.
>
> I have (a) working fine, and (b) is almost working - however, I'm trying
> to
> force the separate field to have the original offsets of where it existed
> in the text.  As in, if the title was at characters 76-200 in the original
> text, I'd like the field to have that as its information, so when I look
> at
> the field I can find the place in the document quickly.
>
> I don't seem to be able to do this - I have my own analyzer that finds the
> tokens and sets the start and end offsets accordingly.  However, when I
> create the new field and write it to the index, it seems like these
> offsets
> are ignored?  When I pull offsets out later, they start at 0 and move up
> from there.
>
> I am creating the field like:
>
> CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
> analyzer.addAnalyzer(info.indexName, psa);
>
> TokenStream ts = psa.tokenStream(info.indexName,
>                                              new StringReader(info.value
> ));
> Field stemF = new Field(info.indexName, ts,
>
> Field.TermVector.WITH_POSITIONS_OFFSETS);
> d.add(stemF);
>
> (d is the document being indexed).
>
> I have tried various permutations of creating the field and token stream -
> does anyone have any insights, please?
>
> Thanks in advance,
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Offset Questions (Follow-Up)

Erick Erickson
In reply to this post by Steve Suppe-2
Our mails are crossing....

Not that I know of. But why don't you just index (or maybe just store)
a separate field containing your offset information? Something like
title_offset with, say, a comma-separated pair denoting char position
and length that you then read in at search time and parse.....

But your tokenizer controls *everything*. Why isn't the Token being
returned from your next() method being constructed with the
offsets you desire?

Erick

On Fri, Mar 7, 2008 at 2:39 PM, Steve Suppe <[hidden email]> wrote:

> OK, I think I understand what's going on - it looks like I am able to set
> the token for the full author name (Say, "Steve Suppe") with the correct
> offsets, but the analyzer takes it once step further and tokenizes 'Steve'
> and 'Suppe' which is giving me a lot more generated offsets and is
> confusing me.
>
> I like the tokenization, as it allows me to just search for Suppe and get
> results.  However, I don't want those "sub-offsets" returned.  Is there a
> way to distinguish the 'main' offsets for the whole field?
>
> Thanks again,
> Steve
>
> At 10:38 AM 3/7/2008, you wrote:
> >Hi all,
> >
> >I'm trying to index documents so that a) I have all the documents indexed
> >'normally' (in that I can search for documents that match certain words,
> >and b) parts of the document that I consider important, such as author
> and
> >title are ALSO stored in their own indexed fields.
> >
> >I have (a) working fine, and (b) is almost working - however, I'm trying
> >to force the separate field to have the original offsets of where it
> >existed in the text.  As in, if the title was at characters 76-200 in the
> >original text, I'd like the field to have that as its information, so
> when
> >I look at the field I can find the place in the document quickly.
> >
> >I don't seem to be able to do this - I have my own analyzer that finds
> the
> >tokens and sets the start and end offsets accordingly.  However, when I
> >create the new field and write it to the index, it seems like these
> >offsets are ignored?  When I pull offsets out later, they start at 0 and
> >move up from there.
> >
> >I am creating the field like:
> >
> >CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
> >analyzer.addAnalyzer(info.indexName, psa);
> >
> >TokenStream ts = psa.tokenStream(info.indexName,
> >                                              new StringReader(info.value
> ));
> >Field stemF = new Field(info.indexName, ts,
> >
> Field.TermVector.WITH_POSITIONS_OFFSETS);
> >d.add(stemF);
> >
> >(d is the document being indexed).
> >
> >I have tried various permutations of creating the field and token stream
> -
> >does anyone have any insights, please?
> >
> >Thanks in advance,
> >Steve
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [hidden email]
> >For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Offset Questions

Steve Suppe-2
In reply to this post by Erick Erickson
Hi Erick,

Thanks for the response.  I think I'm starting to get the hang of
this.  That's a really good insight, but I'm wondering how to handle that
if a document can have multiple instances of the same field.  So, instead
of Author, say, City names that are mentioned.  But, as you said, I control
everything, so I may be able to work this out...

Still thinking :)  Thanks so much so far!

Steve

At 12:44 PM 3/7/2008, you wrote:

>What is your analyzer doing? Let's assume you're trying
>to index the title and that your entire text is
>
>"this is a book and HERE IS THE TITLE."
>
>I *think* your underlying analyzer should be returning
>4 tokens with starts of 20 for HERE, 25 for IS,
>28 for THE and 32 for TITTLE, with appropriate  ends.
>Is that what's happening? And perhaps
>
>If the value you're passing in to the analyzer is just the
>title and not the entire text, what you report seems
>perfectly reasonable to me....
>
>But I haven't worked with this very much so take
>this with the appropriate grain of salt...
>
>Best
>Erick
>
>
>On Fri, Mar 7, 2008 at 1:38 PM, Steve Suppe <[hidden email]> wrote:
>
> > Hi all,
> >
> > I'm trying to index documents so that a) I have all the documents indexed
> > 'normally' (in that I can search for documents that match certain words,
> > and b) parts of the document that I consider important, such as author and
> > title are ALSO stored in their own indexed fields.
> >
> > I have (a) working fine, and (b) is almost working - however, I'm trying
> > to
> > force the separate field to have the original offsets of where it existed
> > in the text.  As in, if the title was at characters 76-200 in the original
> > text, I'd like the field to have that as its information, so when I look
> > at
> > the field I can find the place in the document quickly.
> >
> > I don't seem to be able to do this - I have my own analyzer that finds the
> > tokens and sets the start and end offsets accordingly.  However, when I
> > create the new field and write it to the index, it seems like these
> > offsets
> > are ignored?  When I pull offsets out later, they start at 0 and move up
> > from there.
> >
> > I am creating the field like:
> >
> > CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
> > analyzer.addAnalyzer(info.indexName, psa);
> >
> > TokenStream ts = psa.tokenStream(info.indexName,
> >                                              new StringReader(info.value
> > ));
> > Field stemF = new Field(info.indexName, ts,
> >
> > Field.TermVector.WITH_POSITIONS_OFFSETS);
> > d.add(stemF);
> >
> > (d is the document being indexed).
> >
> > I have tried various permutations of creating the field and token stream -
> > does anyone have any insights, please?
> >
> > Thanks in advance,
> > Steve
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]