querying crawldb

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

querying crawldb

Michael Coffey
Hello Nutchians,
I need to be able to query a (nutch 1.x) crawldb for read-only search/sort/summarize purposes, based on combinations of status, fetch_time, score, and things like that. What is a good tool or process for doing such things?
Up until now, I've been doing readdb-dump and then processing the output with python code that I wrote. But this is slow and clunky, and my code probably has bugs. I wonder, would Hive be a good tool for this?
Reply | Threaded
Open this post in threaded view
|

RE: querying crawldb

Markus Jelsma-2
Use -expr for JEXL-expressions on CrawlDatum's or -regex. See the CrawlDatum.java for the fields you can query.

 
 
-----Original message-----
> From:Michael Coffey <[hidden email]>
> Sent: Wednesday 13th September 2017 3:45
> To: User <[hidden email]>
> Subject: querying crawldb
>
> Hello Nutchians,
> I need to be able to query a (nutch 1.x) crawldb for read-only search/sort/summarize purposes, based on combinations of status, fetch_time, score, and things like that. What is a good tool or process for doing such things?
> Up until now, I've been doing readdb-dump and then processing the output with python code that I wrote. But this is slow and clunky, and my code probably has bugs. I wonder, would Hive be a good tool for this?
>