Integrating Nutch w/Alexa

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Integrating Nutch w/Alexa

kkrugler
Hi there,

Has anybody looked into running Nutch with Alexa? I.e. using their
data store as the source for data that you'd typically be fetching?

The fact that their APIs are Perl & C based would make this
non-trivial, I imagine.

I tried searching on their documentation site for Java - kind of
funny that what you get when you click the Search button is one step
removed from a raw dump of a Lucene index.

Found one ref to Java, on a page that says the programmatic
interfaces that allow users to develop applications to process
Alexa's Web repository are written in either the C, Perl, or Java.

But I couldn't find any other Java refs. So maybe that page is out of
date, or a foreshadowing of things to come.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

RE: Integrating Nutch w/Alexa

Howie Wang
I joined their beta program and played around with getting data.
My guess is that you will do very little with their API besides
setting up a program to dump their data to file. It's pretty simple
to just tweak their sample app and print out the fields you want.
All the fetched pages end up in one big file so you have to figure
out a nice way to delimit things.

After you get your fetch dump, download it to your own box and
do whatever you want with it. I guess you'll have to write a utility
to get all the pages into Nutch and index them. I haven't gotten
around to doing it. I don't think it's worth it to do your processing
on the Alexa servers. You'd have to pay for extra processing time
and still pay for the download of the Nutch index.

The idea behind the Alexa service is quite nice. Unfortunately
they seem not to have a lot of the pages that I'm looking for.
Still, Alexa is most likely the best way of jump starting your index.
For $1000 (the cost of a good crawling PC), you can download
nearly a TB of data.

Howie

>From: Ken Krugler <[hidden email]>
>Hi there,
>
>Has anybody looked into running Nutch with Alexa? I.e. using their data
>store as the source for data that you'd typically be fetching?
>
>The fact that their APIs are Perl & C based would make this non-trivial, I
>imagine.
>
>I tried searching on their documentation site for Java - kind of funny that
>what you get when you click the Search button is one step removed from a
>raw dump of a Lucene index.
>
>Found one ref to Java, on a page that says the programmatic interfaces that
>allow users to develop applications to process Alexa's Web repository are
>written in either the C, Perl, or Java.
>
>But I couldn't find any other Java refs. So maybe that page is out of date,
>or a foreshadowing of things to come.
>
>-- Ken
>--
>Ken Krugler
>Krugle, Inc.
>+1 530-470-9200