Extract infos from documents and query external sites

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Extract infos from documents and query external sites

HellSpawn
Hi all, I'm new :)

I have to extract some informations from an address book in my site (example: names and surnames) and then use it to build queries on sites like scholar.google.com, indexing the result page with my crawler. Can I do it? How?

Thank you

Rosario Salatiello
Reply | Threaded
Open this post in threaded view
|

Re: Extract infos from documents and query external sites

Stefan Neufeind
HellSpawn wrote:
> Hi all, I'm new :)
>
> I have to extract some informations from an address book in my site
> (example: names and surnames) and then use it to build queries on sites like
> scholar.google.com, indexing the result page with my crawler. Can I do it?
> How?

Not "out of the box". You'd have to figure out building query-strings (I
assume they use GET-parameters) from your addressbook, and you could
then "index" those URLs.

For me the question though remains why you'd want to do that - but you
could :-)

  Stefan
Reply | Threaded
Open this post in threaded view
|

Re: Extract infos from documents and query external sites

HellSpawn
I'm working on a search engine for my university and they want me to do that to create a repository of scientific articles on the web :D

I red something about xpath for extracting exact parts from a document, once done this building the query is very easy but my doubts are about how to insert all of this in the nutch crawler...

Thank you
Reply | Threaded
Open this post in threaded view
|

Re: Extract infos from documents and query external sites

Stefan Groschupf-2
Think about using the google API.

However the way to go could be:

+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data
++ also check out gate as a named entity extraction tool to extract  
names based on patterns and heuristics.
++ write the names in a file.

+ build your query urls
+ inject the query urls in a empty crawl db
+ create a segment fetch it and update the segment agains a second  
empty crawl database
+ remove the first segment and db
+ create a segment with your second db and fetch it.
You second segment will only contains the paper pages.

HTH
Stefan




Am 30.05.2006 um 12:14 schrieb HellSpawn:

>
> I'm working on a search engine for my university and they want me  
> to do that
> to create a repository of scientific articles on the web :D
>
> I red something about xpath for extracting exact parts from a  
> document, once
> done this building the query is very easy but my doubts are about  
> how to
> insert all of this in the nutch crawler...
>
> Thank you
> --
> View this message in context: http://www.nabble.com/Extract+infos 
> +from+documents+and+query+external+sites-t1675003.html#a4624272
> Sent from the Nutch - Dev forum at Nabble.com.
>
>