Interesting use case for "numeric synonyms"

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Interesting use case for "numeric synonyms"

David Spencer-2

I just came across an interesting concept, "numeric synonyms"...I'm
looking at the powerpoint contribution:

However initially I'm using the code in the context of Lucene, not
Nutch, so I've changed it slightly.

I have 200 or so PPT files to test it on, and on around 20% it says
there's no body (i.e. no text). A spot check shows this to be wrong, and
  sure enough the code gets exceptions, squelchs them,
has buffer overruns etc [but I'm not complaining - I know it's hard to
reverse engineer MSFT formats]. has these definitions:
   public static final int PPT_MASTERSLIDE = 1024;
   public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
   public static final int PPT_DRAWINGGROUP_ATOM = 61448;
   public static final int PPT_TEXTCHAR_ATOM = 4000;
   public static final int PPT_TEXTBYTE_ATOM = 4008;
   public static final int PPT_USEREDIT_ATOM = 4085;

So I decided to look for other implemtations of Powerpoint parsers, even
in other languages - the obvious Google searches didn't work
("powerpoint file format"), and was of no help, so I
decided to search for just the numbers above w/ Google i.e. "4000 4008

Now I've used ppthtml from before, but I
had an old note that it sometimes goes into an infinite loop, so I try
not to use it for indexing - but hey, it does the same work as the
Nutch/PPT parser, but Google didn't return it (or its source code) as a
match, so how can that be, surely it uses the same constants...

I start reading ppthtml.c and see:

            switch (type) {
                case 0x0FA0: /* Text String in unicode */
                case 0x0FA8: /* Text String in ASCII */
                case 0x0FBA: /* CString - unicode... */

And sure enough, the 1st 2 hex values there match the java, decimal
values above from [the 3rd one is not covered by the
java code but doesn't seem to matter].

So...the point there any prior art or discussions on covering
this, so a search for a number can find a match even if the number is
represented in other bases?

In Lucene-speak, this means that either when indexing, or parsing the
query, the Analyzer expands a number like, say, 4000 to multiple tokens
at the same offset:
        4000 - decimal, not changed
        0x0*FA0 - hex, "0*" for optional leading zeros
        00*7640 - leading zero usually means octal

Hope this list is a reasonable place for this.

A related question is, is the powerpoint format documented anywhere? For
the life of me I couldn't find out where the various constants came from.