WordDelimiterGraphFilter swallows emojis

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

WordDelimiterGraphFilter swallows emojis

Michael Sokolov-4
WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
like punctuation and thus remove them, but we would like to be able to
search for emoji and use this filter for handling dashes, dots and other
intra-word punctuation.

These filters identify non-word and non-digit characters by two mechanisms:
direct lookup in a character table, and fallback to Unicode class. The
character table can't easily be used to handle emoji since it would need to
be populated with the entire Unicode character set in order to reach
emoji-land. On the other hand, if we change the handling of emoji by class,
and say treat them as word-characters, this will also end up pulling in all
the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
some of these other symbols are more like punctuation (this class is a grab
bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
https://www.compart.com/en/unicode/category/So). On the other other hand,
how do we even identify emoji? I don't think the Java Character API is
adequate to the task. Perhaps we must incorporate a table.

Suppose we come up with a good way to classify emoji; then how should they
be treated in this class? Sometimes they may be embedded in tokens with
other characters: I see people using emoji and other symbols as part of
their names, and sometimes they stand alone (with whitespace separation). I
think one way forward here would be to treat these as a special class akin
to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
CATENATE_EMOJI) as we have for those classes.

Or maybe as a convenience, we provide a way to get a table that encodes the
default classifications of all characters up to some given limit, and then
let the caller modify it? That would at least provide an easy way to treat
emoji as letters.

Any thoughts?
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterGraphFilter swallows emojis

julien Blaize
Hello Michael,

i had previously worked on emoji detection with lucene.

I had to extends the Tokenizer class (and not the TokenFilter like
WordDelimiterFilter) to preserve the delimiter attribute.
I also had to keep track of consecutive delimiters in the character stream
because Lucene default implementation only keep the last one.

Maybe it can put you on the right track to start by looking at the
Tokenizer instead of the TokenFilter.

By the way I used the emoji list from this project to detect sequences of
characters.
https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt
I detect sequences of character and while the sequence is a possible emoji
i keep tracking, when i have a full emoji i put it in the CharTermAttribute
so it's treated as a word and not a delimiter.

Regards
--
Julien Blaize


Le mar. 3 juil. 2018 à 14:00, Michael Sokolov <[hidden email]> a écrit :

> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.
>
> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.
>
> Any thoughts?
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterGraphFilter swallows emojis

Robert Muir
In reply to this post by Michael Sokolov-4
On Tue, Jul 3, 2018 at 8:00 AM, Michael Sokolov <[hidden email]> wrote:

> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.

There are several unicode properties for doing emoji (see e.g. unicode
segmentation algorithms, and tagging function in ICUTokenizer), but
its not based on general category. Additionally emoji may not be
single character but sequences so its more involved than what
WordDelimiterFilter is really ready for. I also don't think we should
start storing/maintaining unicode property tables ourselves, if we
want to fix WordDelimiterFilter, it should just depend on ICU instead.

> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.

There is already a way to provide a table to this thing. But one
bigger issue is word delimiter filter doesn't operate on unicode
codepoints, so I don't think you are gonna be able to do what you
want, since most emoji are not in the BMP. WordDelimiterFilter is
really only suitable for categorizing characters in the BMP, it just
doesn't split surrogates.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterGraphFilter swallows emojis

Robert Muir
In reply to this post by Michael Sokolov-4
> Any thoughts?

best idea I have would be to tokenize with ICUTokenizer, which will
tag emoji sequences as "<EMOJI>" token type, then use
ConditionalTokenFilter to send all tokens EXCEPT those with token type
of  "<EMOJI>" to your WordDelimiterFilter. This way
WordDelimiterFilter never sees the emoji at all and can't screw them
up.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterGraphFilter swallows emojis

Michael Sokolov-4
In reply to this post by julien Blaize
Thanks for the pointer

On Tue, Jul 3, 2018 at 9:04 AM julien Blaize <[hidden email]>
wrote:

> Hello Michael,
>
> i had previously worked on emoji detection with lucene.
>
> I had to extends the Tokenizer class (and not the TokenFilter like
> WordDelimiterFilter) to preserve the delimiter attribute.
> I also had to keep track of consecutive delimiters in the character stream
> because Lucene default implementation only keep the last one.
>
> Maybe it can put you on the right track to start by looking at the
> Tokenizer instead of the TokenFilter.
>
> By the way I used the emoji list from this project to detect sequences of
> characters.
>
> https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt
> I detect sequences of character and while the sequence is a possible emoji
> i keep tracking, when i have a full emoji i put it in the CharTermAttribute
> so it's treated as a word and not a delimiter.
>
> Regards
> --
> Julien Blaize
>
>
> Le mar. 3 juil. 2018 à 14:00, Michael Sokolov <[hidden email]> a
> écrit :
>
> > WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> > like punctuation and thus remove them, but we would like to be able to
> > search for emoji and use this filter for handling dashes, dots and other
> > intra-word punctuation.
> >
> > These filters identify non-word and non-digit characters by two
> mechanisms:
> > direct lookup in a character table, and fallback to Unicode class. The
> > character table can't easily be used to handle emoji since it would need
> to
> > be populated with the entire Unicode character set in order to reach
> > emoji-land. On the other hand, if we change the handling of emoji by
> class,
> > and say treat them as word-characters, this will also end up pulling in
> all
> > the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> > some of these other symbols are more like punctuation (this class is a
> grab
> > bag of all kinds of beautiful dingbats like trademark, degrees-symbols,
> etc
> > https://www.compart.com/en/unicode/category/So). On the other other
> hand,
> > how do we even identify emoji? I don't think the Java Character API is
> > adequate to the task. Perhaps we must incorporate a table.
> >
> > Suppose we come up with a good way to classify emoji; then how should
> they
> > be treated in this class? Sometimes they may be embedded in tokens with
> > other characters: I see people using emoji and other symbols as part of
> > their names, and sometimes they stand alone (with whitespace
> separation). I
> > think one way forward here would be to treat these as a special class
> akin
> > to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> > CATENATE_EMOJI) as we have for those classes.
> >
> > Or maybe as a convenience, we provide a way to get a table that encodes
> the
> > default classifications of all characters up to some given limit, and
> then
> > let the caller modify it? That would at least provide an easy way to
> treat
> > emoji as letters.
> >
> > Any thoughts?
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterGraphFilter swallows emojis

Michael Sokolov-4
In reply to this post by Robert Muir
Yes that sounds good -- this ConditionalTokenFilter is going to be very
helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
around and see about incorporating the emoji rules from there.  Thanks
Robert

On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <[hidden email]> wrote:

> > Any thoughts?
>
> best idea I have would be to tokenize with ICUTokenizer, which will
> tag emoji sequences as "<EMOJI>" token type, then use
> ConditionalTokenFilter to send all tokens EXCEPT those with token type
> of  "<EMOJI>" to your WordDelimiterFilter. This way
> WordDelimiterFilter never sees the emoji at all and can't screw them
> up.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterGraphFilter swallows emojis

Robert Muir
If you customized the rules, maybe have a look at
https://issues.apache.org/jira/browse/LUCENE-8366

The rules got simpler and we also updated the customization example
used for the factory's test.

On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <[hidden email]> wrote:

> Yes that sounds good -- this ConditionalTokenFilter is going to be very
> helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> around and see about incorporating the emoji rules from there.  Thanks
> Robert
>
> On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <[hidden email]> wrote:
>
>> > Any thoughts?
>>
>> best idea I have would be to tokenize with ICUTokenizer, which will
>> tag emoji sequences as "<EMOJI>" token type, then use
>> ConditionalTokenFilter to send all tokens EXCEPT those with token type
>> of  "<EMOJI>" to your WordDelimiterFilter. This way
>> WordDelimiterFilter never sees the emoji at all and can't screw them
>> up.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterGraphFilter swallows emojis

Michael Sokolov-4
Ah I see -- there is \p{Emoji} to start with, which is nice, but also this
extended pictographic -- I'll read more, and get back if I have questions.
Might be a little while before I dig in to this though. Thanks again

On Tue, Jul 3, 2018 at 11:25 AM Robert Muir <[hidden email]> wrote:

> If you customized the rules, maybe have a look at
> https://issues.apache.org/jira/browse/LUCENE-8366
>
> The rules got simpler and we also updated the customization example
> used for the factory's test.
>
> On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <[hidden email]>
> wrote:
> > Yes that sounds good -- this ConditionalTokenFilter is going to be very
> > helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> > around and see about incorporating the emoji rules from there.  Thanks
> > Robert
> >
> > On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <[hidden email]> wrote:
> >
> >> > Any thoughts?
> >>
> >> best idea I have would be to tokenize with ICUTokenizer, which will
> >> tag emoji sequences as "<EMOJI>" token type, then use
> >> ConditionalTokenFilter to send all tokens EXCEPT those with token type
> >> of  "<EMOJI>" to your WordDelimiterFilter. This way
> >> WordDelimiterFilter never sees the emoji at all and can't screw them
> >> up.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Size of Document

Chris Bamford
In reply to this post by Robert Muir
Hi there,

How can I calculate the total size of a Lucene Document that I'm about
to write to an index so I know how many bytes I am writing please?  I
need it for some external metrics collection.

Thanks

- Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Adrien Grand
Hello,

There is no way to compute the byte size of a document. Also note that the
relationship between the size of a document and how much space it will use
in the Lucene index is quite complex.

Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <[hidden email]> a
écrit :

> Hi there,
>
> How can I calculate the total size of a Lucene Document that I'm about
> to write to an index so I know how many bytes I am writing please?  I
> need it for some external metrics collection.
>
> Thanks
>
> - Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Chris Bamford
Hello Adrien,


>
> There is no way to compute the byte size of a document.

I feared that!

> Also note that the
> relationship between the size of a document and how much space it will use
> in the Lucene index is quite complex.
>
I understand. I was wondering if there was maybe some sneaky way of peeking inside the IndexWriter before and after a write to compare buffer sizes?

Thanks
Chris

> Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <[hidden email]> a
> écrit :
>
>> Hi there,
>>
>> How can I calculate the total size of a Lucene Document that I'm about
>> to write to an index so I know how many bytes I am writing please?  I
>> need it for some external metrics collection.
>>
>> Thanks
>>
>> - Chris
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Adrien Grand
IndexWriter.ramBytesUsed() gives you access to the current memory usage of
IndexWriter's buffers, but it can't tell you by how much it increased for a
given document assuming concurrent access to the IndexWriter.

Le mer. 4 juil. 2018 à 15:13, Chris Bamford <[hidden email]> a écrit :

> Hello Adrien,
>
>
> >
> > There is no way to compute the byte size of a document.
>
> I feared that!
>
> > Also note that the
> > relationship between the size of a document and how much space it will
> use
> > in the Lucene index is quite complex.
> >
> I understand. I was wondering if there was maybe some sneaky way of
> peeking inside the IndexWriter before and after a write to compare buffer
> sizes?
>
> Thanks
> Chris
>
> > Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <[hidden email]>
> a
> > écrit :
> >
> >> Hi there,
> >>
> >> How can I calculate the total size of a Lucene Document that I'm about
> >> to write to an index so I know how many bytes I am writing please?  I
> >> need it for some external metrics collection.
> >>
> >> Thanks
> >>
> >> - Chris
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Chris Bamford
> IndexWriter.ramBytesUsed() gives you access to the current memory usage of
> IndexWriter's buffers, but it can't tell you by how much it increased for a
> given document assuming concurrent access to the IndexWriter.
>
Thanks, although I can’t find that API. Is there an equivalent call for Lucene 4.10.3?



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Adrien Grand
It was called IndexWriter.ramSizeInBytes() in 4.10.3.

Le mer. 4 juil. 2018 à 15:35, Chris Bamford <[hidden email]> a écrit :

> > IndexWriter.ramBytesUsed() gives you access to the current memory usage
> of
> > IndexWriter's buffers, but it can't tell you by how much it increased
> for a
> > given document assuming concurrent access to the IndexWriter.
> >
> Thanks, although I can’t find that API. Is there an equivalent call for
> Lucene 4.10.3?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Terry Steichen
In reply to this post by Chris Bamford
In the document types I usually index (.pdf, .docx/.doc, .eml), there
exists a metadata field called "stream_size" that contains the size of
the document on disk.  You don't have to compute it.  Thus, when you
retrieve each document you can pull out the contents of this field and,
if you like, include it in each hitlist entry.


On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:

> Hi there,
>
> How can I calculate the total size of a Lucene Document that I'm about
> to write to an index so I know how many bytes I am writing please?  I
> need it for some external metrics collection.
>
> Thanks
>
> - Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Erick Erickson
But does size on disk help? If the doc has a zillion
images in it, those aren't part of the resulting index
(I'm excluding stored data here)....

On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <[hidden email]> wrote:

> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> exists a metadata field called "stream_size" that contains the size of
> the document on disk.  You don't have to compute it.  Thus, when you
> retrieve each document you can pull out the contents of this field and,
> if you like, include it in each hitlist entry.
>
>
> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>> Hi there,
>>
>> How can I calculate the total size of a Lucene Document that I'm about
>> to write to an index so I know how many bytes I am writing please?  I
>> need it for some external metrics collection.
>>
>> Thanks
>>
>> - Chris
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Chris Bamford
Hi Erick

Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
Thanks

Chris

Sent from my iPhone

> On 4 Jul 2018, at 17:08, Erick Erickson <[hidden email]> wrote:
>
> But does size on disk help? If the doc has a zillion
> images in it, those aren't part of the resulting index
> (I'm excluding stored data here)....
>
>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <[hidden email]> wrote:
>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>> exists a metadata field called "stream_size" that contains the size of
>> the document on disk.  You don't have to compute it.  Thus, when you
>> retrieve each document you can pull out the contents of this field and,
>> if you like, include it in each hitlist entry.
>>
>>
>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>> Hi there,
>>>
>>> How can I calculate the total size of a Lucene Document that I'm about
>>> to write to an index so I know how many bytes I am writing please?  I
>>> need it for some external metrics collection.
>>>
>>> Thanks
>>>
>>> - Chris
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Erick Erickson
I think we're not talking about the same thing.

You asked "How can I calculate the total size of a Lucene Document"...

I was responding to the Terry's comment "In the document types I
usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
called "stream_size" that contains the size of the document on disk. "

Two totally different beasts. One is the source document, the other is
what you choose to put into the index from that document. Not to even
mention that you could, for instance, choose to index only the title
and throw everything else away so the size of the raw document on disk
doesn't seem useful for your case.

Best,
Erick

On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <[hidden email]> wrote:

> Hi Erick
>
> Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
> Thanks
>
> Chris
>
> Sent from my iPhone
>
>> On 4 Jul 2018, at 17:08, Erick Erickson <[hidden email]> wrote:
>>
>> But does size on disk help? If the doc has a zillion
>> images in it, those aren't part of the resulting index
>> (I'm excluding stored data here)....
>>
>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <[hidden email]> wrote:
>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>> exists a metadata field called "stream_size" that contains the size of
>>> the document on disk.  You don't have to compute it.  Thus, when you
>>> retrieve each document you can pull out the contents of this field and,
>>> if you like, include it in each hitlist entry.
>>>
>>>
>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>> Hi there,
>>>>
>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>> to write to an index so I know how many bytes I am writing please?  I
>>>> need it for some external metrics collection.
>>>>
>>>> Thanks
>>>>
>>>> - Chris
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Chris Bamford
Yes I see, I originally missed Terry’s response which is probably the source of the confusion.

So to clarify: I already know the size of the source document. As you say, this bears little resemblance to what actually gets written when indexed. It is this latter figure I was hoping to get.

Thanks everyone.

Chris



> On 5 Jul 2018, at 03:31, Erick Erickson <[hidden email]> wrote:
>
> I think we're not talking about the same thing.
>
> You asked "How can I calculate the total size of a Lucene Document"...
>
> I was responding to the Terry's comment "In the document types I
> usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> called "stream_size" that contains the size of the document on disk. "
>
> Two totally different beasts. One is the source document, the other is
> what you choose to put into the index from that document. Not to even
> mention that you could, for instance, choose to index only the title
> and throw everything else away so the size of the raw document on disk
> doesn't seem useful for your case.
>
> Best,
> Erick
>
>> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <[hidden email]> wrote:
>> Hi Erick
>>
>> Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
>> Thanks
>>
>> Chris
>>
>> Sent from my iPhone
>>
>>> On 4 Jul 2018, at 17:08, Erick Erickson <[hidden email]> wrote:
>>>
>>> But does size on disk help? If the doc has a zillion
>>> images in it, those aren't part of the resulting index
>>> (I'm excluding stored data here)....
>>>
>>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <[hidden email]> wrote:
>>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>>> exists a metadata field called "stream_size" that contains the size of
>>>> the document on disk.  You don't have to compute it.  Thus, when you
>>>> retrieve each document you can pull out the contents of this field and,
>>>> if you like, include it in each hitlist entry.
>>>>
>>>>
>>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>>> Hi there,
>>>>>
>>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>>> to write to an index so I know how many bytes I am writing please?  I
>>>>> need it for some external metrics collection.
>>>>>
>>>>> Thanks
>>>>>
>>>>> - Chris
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Size of Document

Adrien Grand
For the record, this is made even more complex by the fact that the disk
footprint of a document depends on other documents that are indexed nearby
in the same segment, and can change over merges.

Le jeu. 5 juil. 2018 à 08:22, Chris Bamford <[hidden email]> a écrit :

> Yes I see, I originally missed Terry’s response which is probably the
> source of the confusion.
>
> So to clarify: I already know the size of the source document. As you say,
> this bears little resemblance to what actually gets written when indexed.
> It is this latter figure I was hoping to get.
>
> Thanks everyone.
>
> Chris
>
>
>
> > On 5 Jul 2018, at 03:31, Erick Erickson <[hidden email]> wrote:
> >
> > I think we're not talking about the same thing.
> >
> > You asked "How can I calculate the total size of a Lucene Document"...
> >
> > I was responding to the Terry's comment "In the document types I
> > usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> > called "stream_size" that contains the size of the document on disk. "
> >
> > Two totally different beasts. One is the source document, the other is
> > what you choose to put into the index from that document. Not to even
> > mention that you could, for instance, choose to index only the title
> > and throw everything else away so the size of the raw document on disk
> > doesn't seem useful for your case.
> >
> > Best,
> > Erick
> >
> >> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <[hidden email]>
> wrote:
> >> Hi Erick
> >>
> >> Yes, size on disk is what I’m after as it will feed into an eventual
> calculation regarding actual bytes written (not interested in the source
> data document size, just real disk usage).
> >> Thanks
> >>
> >> Chris
> >>
> >> Sent from my iPhone
> >>
> >>> On 4 Jul 2018, at 17:08, Erick Erickson <[hidden email]>
> wrote:
> >>>
> >>> But does size on disk help? If the doc has a zillion
> >>> images in it, those aren't part of the resulting index
> >>> (I'm excluding stored data here)....
> >>>
> >>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <[hidden email]>
> wrote:
> >>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> >>>> exists a metadata field called "stream_size" that contains the size of
> >>>> the document on disk.  You don't have to compute it.  Thus, when you
> >>>> retrieve each document you can pull out the contents of this field
> and,
> >>>> if you like, include it in each hitlist entry.
> >>>>
> >>>>
> >>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
> >>>>> Hi there,
> >>>>>
> >>>>> How can I calculate the total size of a Lucene Document that I'm
> about
> >>>>> to write to an index so I know how many bytes I am writing please?  I
> >>>>> need it for some external metrics collection.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> - Chris
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [hidden email]
> >>>>> For additional commands, e-mail: [hidden email]
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [hidden email]
> >>>> For additional commands, e-mail: [hidden email]
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
12