Removing URLs from index

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing URLs from index

Jeroen van Vianen
Hi,

I happen to have accumulated a lot of URLs in my index with the
following layout:

http://www.company.com/directory1;if(T.getElementsByClassName(
http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case

There seem to be errors in the discovery of links from one page to the
next. I have now excluded URLs with a ';' in regex-urlfilter.txt.

My question now is, how do I remove these documents from the index?

Regards,


Jeroen
Reply | Threaded
Open this post in threaded view
|

Re: Removing URLs from index

Markus Jelsma
Hi,

I assume it's about your Solr index again (for which you should mail to the
Solr mailinglist). It features deleteById and deleteByQuery methods but in
your case it's going to be rather hard. Your URL field is, using the stock
schema, analyzed and has a tokenizer that strips characters such as your
semicolon. Perhaps you can find a common trait amongst your bogus URL's that
can be queried. If not, you must do it manually.

But, if you reindex from Nutch, the already fetched and parsed pages will
reappear in your Solr index. Removing data from Nutch is really hard but
because of your urlfilter, the generate command will no longer add those URL's
to the fetch queue but the pages are still in the segments.

Cheers,

On Tuesday 17 August 2010 13:04:21 Jeroen van Vianen wrote:

> Hi,
>
> I happen to have accumulated a lot of URLs in my index with the
> following layout:
>
> http://www.company.com/directory1;if(T.getElementsByClassName(
> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break
> ;case
>
> There seem to be errors in the discovery of links from one page to the
> next. I have now excluded URLs with a ';' in regex-urlfilter.txt.
>
> My question now is, how do I remove these documents from the index?
>
> Regards,
>
>
> Jeroen
>

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply | Threaded
Open this post in threaded view
|

Re: Removing URLs from index

Alex McLintock
In reply to this post by Jeroen van Vianen
On 17 August 2010 12:04, Jeroen van Vianen <[hidden email]> wrote:
> Hi,
>
> I happen to have accumulated a lot of URLs in my index with the following
> layout:
>
> http://www.company.com/directory1;if(T.getElementsByClassName(
> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case

Hmmm,

This may be thinking out loud rather than helpful:

I thought ";" was supposed to introduce a session id. I wonder if we
can or should be ignoring everything after the ";" character.

I've recently seen cases where something which looked like a URL
appeared in some Javascript and Nutch identified it as something to
crawl. I don't know whether there is a easy fix.


> There seem to be errors in the discovery of links from one page to the next.
> I have now excluded URLs with a ';' in regex-urlfilter.txt.
>
> My question now is, how do I remove these documents from the index?


Not sure. I suppose you could add in a plugin of your own which gets
used when you extract the index - but I guess that would be too much
trouble for you.

May I ask why you want them removed from the index? Is it because you
don't want users seeing them?

Alex
> Regards,
>
>
> Jeroen
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing URLs from index

Jeroen van Vianen
On 17-8-2010 13:35, Alex McLintock wrote:

>> I happen to have accumulated a lot of URLs in my index with the following
>> layout:
>>
>> http://www.company.com/directory1;if(T.getElementsByClassName(
>> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case
>
> Hmmm,
>
> This may be thinking out loud rather than helpful:
>
> I thought ";" was supposed to introduce a session id. I wonder if we
> can or should be ignoring everything after the ";" character.

Maybe we should. I'm unsure why these JS fragments have been added to
the URLs to crawl in the first place. Problem is that the webserver is
happily serving URLs with above structure and generates proper content,
probably because the JS fragment is an invalid session id and the
webserver will automatically create a new session.

> I've recently seen cases where something which looked like a URL
> appeared in some Javascript and Nutch identified it as something to
> crawl. I don't know whether there is a easy fix.
>
>
>> There seem to be errors in the discovery of links from one page to the next.
>> I have now excluded URLs with a ';' in regex-urlfilter.txt.
>>
>> My question now is, how do I remove these documents from the index?
>
>
> Not sure. I suppose you could add in a plugin of your own which gets
> used when you extract the index - but I guess that would be too much
> trouble for you.
>
> May I ask why you want them removed from the index? Is it because you
> don't want users seeing them?

Yes. I have lots of similar results because of these URLs occurring many
times for the same original URL.

Thanks and best regards,


Jeroen
Reply | Threaded
Open this post in threaded view
|

Re: Removing URLs from index

Jeroen van Vianen
In reply to this post by Markus Jelsma
On 17-8-2010 13:35, Markus Jelsma wrote:
> I assume it's about your Solr index again (for which you should mail to the
> Solr mailinglist). It features deleteById and deleteByQuery methods but in
> your case it's going to be rather hard. Your URL field is, using the stock
> schema, analyzed and has a tokenizer that strips characters such as your
> semicolon. Perhaps you can find a common trait amongst your bogus URL's that
> can be queried. If not, you must do it manually.

That's too bad as I'm unsure which URLs to look for. I think I'll just
remove the entire domainname and crawl it again.

> But, if you reindex from Nutch, the already fetched and parsed pages will
> reappear in your Solr index. Removing data from Nutch is really hard but
> because of your urlfilter, the generate command will no longer add those URL's
> to the fetch queue but the pages are still in the segments.

Clear.

Thanks,


Jeroen
Reply | Threaded
Open this post in threaded view
|

Re: Removing URLs from index

Markus Jelsma
In reply to this post by Jeroen van Vianen

On Tuesday 17 August 2010 13:47:32 Jeroen van Vianen wrote:
>
> Yes. I have lots of similar results because of these URLs occurring many
> times for the same original URL.

You can use deduplication [1]. It generates signatures for (near) exact
content depending on configuration. It can then optionally overwrite (delete)
duplicates.

[1]: http://wiki.apache.org/solr/Deduplication

>
> Thanks and best regards,
>
>
> Jeroen
>

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350