parsing and using xml-data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

parsing and using xml-data

Karsten Dello
Dear list,

I would like to process metadata from publication repositories into a
nutch index.
The metadata comes as xml (OAI_PMH to be more precise).

The starting URLs look like

http://oai_host/servlet?method=getRecords&set=someSet

Theses requests return lists,
which basically look like

<list>
<item>
        <id>32423</id>
        <content>very long desciption1, e.g. an abstract</content>
        <url>http://somewhere.com/somedoc1.pdf</url>
</item>

<item>
        <id>12441</id>
        <content>very long desciption2, e.g. an abstract</content>
        <url>http://somewhereelse.it/somedoc2.pdf</url>
</item>

</list>

My initial idea was to utilize the Parser-Extension-Point
and provide a plugin which works the same way the rss-parser does:
return all outlinks to the detailed view forms
- e.g.
http://oai_host/servlet?method=getSingleRecord&id=_value_of_id-element_ -
and skip the content of the list.

Following these links would return documents with one item only.
Is it possible to store these documents with the url from the
<url>-element instead of the "real" url (i.e. the servlet-uri used for
the request)?

Would this work out? Can you suggest a better approach?

Anyway, refetching all single hits is pretty much a waste,
as all information is already included in the list.
Any comments on that?


Help would be very much appreaciated,

Best regards

Karsten


Reply | Threaded
Open this post in threaded view
|

Re: parsing and using xml-data

Stefan Groschupf-2
Hi Karsten,

nutch has the limitation one url one document (in crawlDB or index).
The content and metadata for this document is normally available  
'behind' url. The only exception is the anchor text. Anchor text are  
data from the "mother" url that is passed and indexed within the  
"child" document.
So you can hack something and try to use the anchor text as data  
container, but not sure if that will solve your problem.

I suggest to extract the links from your starting urls and try to get  
the content from the detail pages.
Not sure if that will help you.
Stefan


Am 08.06.2006 um 21:42 schrieb Karsten Dello:

> Dear list,
>
> I would like to process metadata from publication repositories into  
> a nutch index.
> The metadata comes as xml (OAI_PMH to be more precise).
>
> The starting URLs look like
>
> http://oai_host/servlet?method=getRecords&set=someSet
>
> Theses requests return lists,
> which basically look like
>
> <list>
> <item>
> <id>32423</id>
> <content>very long desciption1, e.g. an abstract</content>
> <url>http://somewhere.com/somedoc1.pdf</url>
> </item>
>
> <item>
> <id>12441</id>
> <content>very long desciption2, e.g. an abstract</content>
> <url>http://somewhereelse.it/somedoc2.pdf</url>
> </item>
>
> </list>
>
> My initial idea was to utilize the Parser-Extension-Point
> and provide a plugin which works the same way the rss-parser does:
> return all outlinks to the detailed view forms
> - e.g. http://oai_host/servlet?
> method=getSingleRecord&id=_value_of_id-element_ -
> and skip the content of the list.
>
> Following these links would return documents with one item only.
> Is it possible to store these documents with the url from the <url>-
> element instead of the "real" url (i.e. the servlet-uri used for  
> the request)?
>
> Would this work out? Can you suggest a better approach?
>
> Anyway, refetching all single hits is pretty much a waste,
> as all information is already included in the list.
> Any comments on that?
>
>
> Help would be very much appreaciated,
>
> Best regards
>
> Karsten
>
>
>