The more I look at CrawlDbReducer the less I like the method it uses to
select the most recent records.
This selection is primarily made in the while() loop in
CrawlDbReducer:45. My main objection is that selecting the "highest"
value (meaning "most recent") relies on the fact that values of status
codes in CrawlDatum are ordered according to their meaning, and they are
treated as a sort of state machine. However, adding new states is very
difficult, if they should have values lower than STATUS_FETCH_GONE, as
it leads to breaking backwards-compatibility with older segment data.
Adding status codes with higher values may also break things here,
because a CrawlDatum with the highest code would not be necessarily the
I encountered this problem first when adding the signature framework,
fortunately there was one unused value (0) at that time, so I could add
CrawlDatum.STATUS_SIGNATURE without breaking the assumptions in
However, now things become more difficult:
* we need another status code for newly discovered pages discovered as a
result of redirection (see the thread on "Meta-refresh"). If we add this
status as e.g. STATUS_FETCH_REDIRECT = 8, then the logic in
CrawlDbReducer will break.
* we need something to mark pages as "being on a fetchlist, to be
updated soon" (this is to support multiple parallel
generate/fetch/update cycles). A new status code would do fine for this
purpose (although we need an expiry timer for that too). Arguably, we
could use the same trick that we used in 0.7 (moving next fetch time 1
week into the future), but I'm not sure yet how it would play with the
adaptive fetch patches, which manipulate this value too...
I could use a hack in the meantime: status values are for now all below
128, we could use the upper nibble for these additional flags, and mask
them out with 0x0f.
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Andrzej Bialecki wrote:
> This selection is primarily made in the while() loop in
> CrawlDbReducer:45. My main objection is that selecting the "highest"
> value (meaning "most recent") relies on the fact that values of status
> codes in CrawlDatum are ordered according to their meaning, and they are
> treated as a sort of state machine.
Yes, that was the design, that status codes are also priorities.
> However, adding new states is very
> difficult, if they should have values lower than STATUS_FETCH_GONE, as
> it leads to breaking backwards-compatibility with older segment data.
We can use CrawlDatum.VERSION to insert new status codes
back-compatibly. Perhaps we should change the codes to, instead of [0,
1, 2, ...] to be [0, 10, 20, 30, ...] so that we can more easily
introduce new values? To update status codes from older versions we
simply multiply by 10.
Would something like that work?
Or we could have a separate table mapping status codes to priority.