Indexing msword document properties

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing msword document properties

ahammad
I have successfully gotten Nutch to index msword documents. If you go under File>Properties, and under the "Custom" tab in MS Word, you can add some properties to the file, sort of like HTML meta tags.

I have the msword parser, index-more and query-more plugins, as well as a custom meta tag indexer/filter installed. My question is can Nutch read document properties like the ones I described? Does it have the ability to go that far in the document to extract the custom user-defined properties?

If so, was there anybody that successfully implemented this? If not, I would imagine that we need to modify index-more/query-more plugins to do that. Can someone confirm this?

Anyone know of a good place to start looking? Any help will be appreciated.

Cheers.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing msword document properties

ahammad
Hello,

I've been looking further into this and it seems like the only way to do it is to modify the msword parser so that it reads in the custom properties information. I'm attempting this but so far, I wasn't successful.

The classes that I found that may be useful are org.apache.poi.hpsf.DocumentSummaryInformation and org.apache.poi.hpsf.CustomProperties. Not sure if there are other things that I need.

I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the lib-parsems plugin. Am I proceeding correctly with this or am I just wasting my time?

Anybody has any other suggestions? This seems like it'll be a lot of work with a very small chance of success. Any alternative methods would be nice.

Thanks a lot.

Cheers


ahammad wrote
I have successfully gotten Nutch to index msword documents. If you go under File>Properties, and under the "Custom" tab in MS Word, you can add some properties to the file, sort of like HTML meta tags.

I have the msword parser, index-more and query-more plugins, as well as a custom meta tag indexer/filter installed. My question is can Nutch read document properties like the ones I described? Does it have the ability to go that far in the document to extract the custom user-defined properties?

If so, was there anybody that successfully implemented this? If not, I would imagine that we need to modify index-more/query-more plugins to do that. Can someone confirm this?

Anyone know of a good place to start looking? Any help will be appreciated.

Cheers.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing msword document properties

Doğacan Güney-3
On Fri, Jan 30, 2009 at 9:15 PM, ahammad <[hidden email]> wrote:

>
> Hello,
>
> I've been looking further into this and it seems like the only way to do it
> is to modify the msword parser so that it reads in the custom properties
> information. I'm attempting this but so far, I wasn't successful.
>
> The classes that I found that may be useful are
> org.apache.poi.hpsf.DocumentSummaryInformation and
> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
> that I need.
>
> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
> lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
> my time?
>
> Anybody has any other suggestions? This seems like it'll be a lot of work
> with a very small chance of success. Any alternative methods would be nice.
>

No, you are doing the right thing. Alternatively, if you know of a
good java library
for extracting the information you are looking for; you can write your
own parse-ms
plugin as well.

Extract any metadata you want and put them in parse data metadata. You can then
read them during indexing and add them to your index.

> Thanks a lot.
>
> Cheers
>
>
>
> ahammad wrote:
>>
>> I have successfully gotten Nutch to index msword documents. If you go
>> under File>Properties, and under the "Custom" tab in MS Word, you can add
>> some properties to the file, sort of like HTML meta tags.
>>
>> I have the msword parser, index-more and query-more plugins, as well as a
>> custom meta tag indexer/filter installed. My question is can Nutch read
>> document properties like the ones I described? Does it have the ability to
>> go that far in the document to extract the custom user-defined properties?
>>
>> If so, was there anybody that successfully implemented this? If not, I
>> would imagine that we need to modify index-more/query-more plugins to do
>> that. Can someone confirm this?
>>
>> Anyone know of a good place to start looking? Any help will be
>> appreciated.
>>
>> Cheers.
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Indexing-msword-document-properties-tp21715700p21753762.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



--
Doğacan Güney
adb
Reply | Threaded
Open this post in threaded view
|

Re: Indexing msword document properties

adb
Nutch 0.9 already extracts the properties in MSExtractor.java and MSBaseParser
puts them into the MetaData class.

I'm not using Nutch in its entirety, only the parsing framework, but I am
indexing the document properties quite happily from MS documents.  I also wrote
a new parser for Office 2007, using POI 3.5 and that is also getting the
properties in a similar way.  Is the problem at a higher level in that Nutch is
not indexing the MetaData?

Antony




Doğacan Güney wrote:

> On Fri, Jan 30, 2009 at 9:15 PM, ahammad <[hidden email]> wrote:
>> Hello,
>>
>> I've been looking further into this and it seems like the only way to do it
>> is to modify the msword parser so that it reads in the custom properties
>> information. I'm attempting this but so far, I wasn't successful.
>>
>> The classes that I found that may be useful are
>> org.apache.poi.hpsf.DocumentSummaryInformation and
>> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
>> that I need.
>>
>> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
>> lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
>> my time?
>>
>> Anybody has any other suggestions? This seems like it'll be a lot of work
>> with a very small chance of success. Any alternative methods would be nice.
>>
>
> No, you are doing the right thing. Alternatively, if you know of a
> good java library
> for extracting the information you are looking for; you can write your
> own parse-ms
> plugin as well.
>
> Extract any metadata you want and put them in parse data metadata. You can then
> read them during indexing and add them to your index.
>
>> Thanks a lot.
>>
>> Cheers


Reply | Threaded
Open this post in threaded view
|

Re: Indexing msword document properties

ahammad
Seems like my previous message never went through.

The Nutch msword parser does index _some_ metadata. If you go into File>Properties and under the Summary tab (in Microsoft Word), that metadata is indexed (like author, company etc.). However, you can add custom properties (File>Properties under the Custom tab) to any Word document. That metadata is not indexed.

As an example, I have a set of files that have some information relating to product types. In those files, there is a custom property called productType, which can contain values like fax, printer, monitor etc.

What I want to be able to do is to index those files so I can be able to search on the product type. For instance, if I put "canon +productType:printer", I'll get only the documents that have to do with Canon printers. I already have a query filter in place that can do that, it's just a matter of getting the productType custom property in the index.

The POI parser that you wrote, does it have the ability to parse custom properties from Microsoft Word documents?

Thank you for your reply.

Cheers


Antony Bowesman wrote
Nutch 0.9 already extracts the properties in MSExtractor.java and MSBaseParser
puts them into the MetaData class.

I'm not using Nutch in its entirety, only the parsing framework, but I am
indexing the document properties quite happily from MS documents.  I also wrote
a new parser for Office 2007, using POI 3.5 and that is also getting the
properties in a similar way.  Is the problem at a higher level in that Nutch is
not indexing the MetaData?

Antony




Doğacan Güney wrote:
> On Fri, Jan 30, 2009 at 9:15 PM, ahammad <ahmed.hammad@gmail.com> wrote:
>> Hello,
>>
>> I've been looking further into this and it seems like the only way to do it
>> is to modify the msword parser so that it reads in the custom properties
>> information. I'm attempting this but so far, I wasn't successful.
>>
>> The classes that I found that may be useful are
>> org.apache.poi.hpsf.DocumentSummaryInformation and
>> org.apache.poi.hpsf.CustomProperties. Not sure if there are other things
>> that I need.
>>
>> I'm currently trying to modify MSExtractor.java and MSBaseParser.java in the
>> lib-parsems plugin. Am I proceeding correctly with this or am I just wasting
>> my time?
>>
>> Anybody has any other suggestions? This seems like it'll be a lot of work
>> with a very small chance of success. Any alternative methods would be nice.
>>
>
> No, you are doing the right thing. Alternatively, if you know of a
> good java library
> for extracting the information you are looking for; you can write your
> own parse-ms
> plugin as well.
>
> Extract any metadata you want and put them in parse data metadata. You can then
> read them during indexing and add them to your index.
>
>> Thanks a lot.
>>
>> Cheers

adb
Reply | Threaded
Open this post in threaded view
|

Re: Indexing msword document properties

adb
> Seems like my previous message never went through.
>
> The Nutch msword parser does index _some_ metadata. If you go into
> File>Properties and under the Summary tab (in Microsoft Word), that metadata
> is indexed (like author, company etc.). However, you can add custom
> properties (File>Properties under the Custom tab) to any Word document. That
> metadata is not indexed.
>
> As an example, I have a set of files that have some information relating to
> product types. In those files, there is a custom property called
> productType, which can contain values like fax, printer, monitor etc.
>
> What I want to be able to do is to index those files so I can be able to
> search on the product type. For instance, if I put "canon
> +productType:printer", I'll get only the documents that have to do with
> Canon printers. I already have a query filter in place that can do that,
> it's just a matter of getting the productType custom property in the index.
>
> The POI parser that you wrote, does it have the ability to parse custom
> properties from Microsoft Word documents?

It didn't, but I just added it - it was trivial.  I'm using POI 3.5 and my
parser is doing something like

     byte[] raw = content.getContent();
     POITextExtractor extractor = ExtractorFactory.createExtractor(new
ByteArrayInputStream(raw));
     text = extractor.getText();
     if (POIOLE2TextExtractor.class.isAssignableFrom(extractor.getClass()))
     {
         properties = getOLE2MetaData((POIOLE2TextExtractor)extractor);
     }
     else if (POIXMLTextExtractor.class.isAssignableFrom(extractor.getClass()))
     {
         properties = getXMLMetaData((POIXMLTextExtractor)extractor);
     }

I just tried getting custom properties from the OLE2 text extractor, which is
based on the MSExtractor implementation

     private Properties getOLE2MetaData(POIOLE2TextExtractor extractor)
     {
         Properties props = new Properties();
         SummaryInformation si = extractor.getSummaryInformation();
...
         DocumentSummaryInformation dsi = extractor.getDocSummaryInformation();
         CustomProperties cp = dsi.getCustomProperties();
         Iterator i = cp.keySet().iterator();
         while (i.hasNext())
         {
             String name = (String)i.next();
             setProperty(props, name, cp.get(name).toString());
         }
         return props;
     }

This works nicely.  I didn't try the XML variant, but I guess that would be
pretty similar.
Antony




Reply | Threaded
Open this post in threaded view
|

Restarting Nutch

Hrishikesh Agashe
Hi,

I am planning to do a huge crawl using Nutch (billions of URLs) and so need
to understand whether Nutch can handle restarts after a crash.

For single system, if I do Ctrl+C while Nutch is running and then restart
it, will it be possible for Nutch to detect where it has reached in last run
and start from that point onwards? Or will it be considered as new fresh
crawl?

Also if I have 5 nodes running Nutch and doing the crawling, if one of the
node fails, should it be considered as total failure of Nutch itself? Or
should I allow other nodes to proceed further? Will I loose data gathered by
the failed node?

TIA,
--Hrishi


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Reply | Threaded
Open this post in threaded view
|

Re: Restarting Nutch

Sami Siren-2
[moving this to nutch-user]

Hrishikesh Agashe wrote:

> Hi,
>
>  
>
> I am planning to do a huge crawl using Nutch (billions of URLs) and so need
> to understand whether Nutch can handle restarts after a crash.
>
>  
>
> For single system, if I do Ctrl+C while Nutch is running and then restart
> it, will it be possible for Nutch to detect where it has reached in last run
> and start from that point onwards? Or will it be considered as new fresh
> crawl?
>  
Nutch does not try to resume the action that was interrupted.
> Also if I have 5 nodes running Nutch and doing the crawling, if one of the
> node fails, should it be considered as total failure of Nutch itself? Or
> should I allow other nodes to proceed further? Will I loose data gathered by
> the failed node?
>  
Hadoop will execute the remaining tasks at nodes that are available.
Usually data will be stored on a shared/distributed filesystem (like
HDFS). If your setup is similar and you ensure that the filesystem can
survive single node failures your data should be safe.

--
 Sami Siren