The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.

Operator: Staecker

Automatic or Manually Assisted: Eventually automatic, manual during testing

Programming Language(s): python (pywikipedia)

Function Summary: nominate duplicate files for speedy deletion

Edit period(s) (e.g. Continuous, daily, one time run): A run every 15 minutes or so, with only a few edits per run.

Edit rate requested: I estimate no more than 100 edits per day. A few for each dupe, which seem to pop up a few times per hour (my own informal estimate).

Function Details: The bot will use Special:Newimages to find duplicate files. It seems that most dupes are uploaded within the same 20 minutes or so by inexperienced users who forget the name of their first upload, or want to change the name for other reasons. So the bot will just search the new images since its last run. The bot will first compare the file sizes (which are given in the images gallery, so require no download), and if two files with identical file size are found, the files thumbnails are downloaded and compared directly.

When a duplicate is found, the bot will choose an orphaned copy and nominate it for speedy deletion, requiring a few edits. These will be the only edits made by the bot.

The bot does not detect scaled copies of files, or two "same" images in different file formats. Also no effort is made to root out old duplicate files which have existed on WP for a long time. Only new files are searched.

Since we only download thumbnails of size-duplicates, and will make very few edits, the server load should be negligible.

This is my first bot, so I'd appreciate any constructive feedback- Staecker

Discussion[edit]

Good idea, question though, how will it decide which of the two should be deleted if neither are orphaned? Will it choose the first one, or the one that is most linked, or will it only mark orphaned ones?

Another idea, mediawiki uses ImageMagick to resize the images. I think if you reduce picture 1 to 60px, and picture 2 to 60px, they will still match if they have the same source file. So instead of DL'n the whole image, you can download thumbnails. I will do some testing on that for ya and tell you if it works. HighInBC (Need help? Ask me) 18:35, 5 February 2007 (UTC)[reply]

It works, both the thumbnails here User:HighInBC/sandbox have the same md5 checksum(eda300872a5b61eaf64574ee9fff373d), and are only 2kb each. That should save some bandwidth. HighInBC (Need help? Ask me) 18:41, 5 February 2007 (UTC)[reply]

Good point about the thumbnails. That will save some bandwidth. As for the orphaning issues, I haven't given it much sophistication as of now. If neither is orphaned, it'll keep them both. But that should be a pretty easy upgrade to make in the future. Staecker 18:51, 5 February 2007 (UTC)[reply]
I've just implemented a thumbnail downloader instead of the full image. Thanks to HighInBC for saving some WP server load, and some of my own too. And more importantly, a much more elegant solution. Staecker 02:37, 6 February 2007 (UTC)[reply]

It makes a lot of sense to avoid one's that aren't orphaned, because of all the things to be taken into account when orphaning a duplicate, which is the better filename, which one has the better description page etc... Once it is up and running, then you can work on the bells and whistles. I support this bot. HighInBC (Need help? Ask me) 18:54, 5 February 2007 (UTC)[reply]

this is a great Idea. can you look into converting the least linked image to the other? and you might also want to look into the current Images that are allready on the servers if you have time/bandwidth to do so. Betacommand (talkcontribsBot) 18:57, 5 February 2007 (UTC)[reply]
I have the server time/bandwidth to search old images, but I'm not sure that I have the coding time. An "old files" search would require a new way of retrieving uploads (I get them now from Special:Newimages), and I don't have much spare time to code all that. But I agree it'd be a great feature, and maybe some functionality I'll add (or somebody else?) in the future. Staecker 19:23, 5 February 2007 (UTC)[reply]
If you wanted to do old images, you could download the image description database dump and work offline from that. This bot sounds like such a great idea, I can help you code it, but I work in perl, I find python to be rather awkward. Perhaps once it is up and running I can write a script that creates a dataset(of old files with identical filesizes) for your bot to use. HighInBC (Need help? Ask me) 19:28, 5 February 2007 (UTC)[reply]

Is there a certainty that the one with more links is the best one? What if the one with fewer links has a more accurate set of copyright information? But then again, each speedy delete should be looked at by an admin. HighInBC (Need help? Ask me) 19:18, 5 February 2007 (UTC)[reply]

Can you just scrape the Image page for the resolution, and look for the bigger one out of the two? -- Tawker 17:48, 9 February 2007 (UTC)[reply]

Umm, that program detect identical images only, the will always have the same resolution or the bot won't detect them as being the same. HighInBC (Need help? Ask me) 17:52, 9 February 2007 (UTC)[reply]

(Edit conflict) Two images of different resolutions won't be detected as duplicates by the bot, since they'll have different file sizes. The bot wouldn't even examine them. They wouldn't even necessarily have the same thumbnail, since different scaling algorithms will produce different outputs (especially in lossy formats like JPG). Staecker 18:11, 9 February 2007 (UTC)[reply]
There are routines in the metapixel source that can detect how similar an image is to another, it only gives a probability though, so any results from that would need a human eye. I played around with it before to find double in my own collection, it tends to be right about 2/3rds of the time, so I don't think that is enough for this purpose. HighInBC (Need help? Ask me) 20:23, 9 February 2007 (UTC)[reply]

For a first version, why not just find duplicates and post the information somewhere for humans to clean up. Once that works, then extend the bot to deal with the easy and common cases. Leave the hard or uncommon cases till last, if at all. Regards, Ben Aveling 01:47, 11 February 2007 (UTC)[reply]

I agree, if I'm understanding well. I think a good easy system for v1.0 would be to speedy nominate any duplicates which have at least one orphan, deleting the orphaned version. In the case where neither is orphaned, I'll post somewhere links to the two of them for a human to sort out. Is there a good page in the WP namespace for such posts to go? Or should I just maintain a user subpage?
I hope you're not suggesting that all duplicates be human-sorted, even cases with an orphan. The simple cases are not really difficult at all, and it would be a lot of tedious work for a human to look through a giant list and speedy nom everything. But of course we could go that way if there's consensus that it's necessary. Staecker 19:58, 11 February 2007 (UTC)[reply]
I agree that clearly identical images should be handled entirely by the bot, I think that the bot would benefit from considering less obvious cases and tagging them for a human to compare to a duplicate, including cases of images with different resolutions. Also, the bot maker seems to be of the mindset that searching existing images would be too difficult to implement, but I'm curious if you'd find it easier to implement a memory of images the bot has already seen from the newimages page. I think it would be easier to code than searching all images and since you'd only be storing a checksum it wouldn't take up that much hard drive space. Vicarious 08:36, 22 February 2007 (UTC)[reply]
To say that detecting duplicates of different resolutions is "less obvious" is an understatement. I know of no way to do this, simple or not (other than cutting-edge probabalistic techniques cited above by HighinBC). Two images of different resolutions will not necessarily have the same thumbnail in the gallery, and there's no simple way to recognize that they represent the same image (to a human it's simple, but not to software).
As for storing previously seen images, this is possible in theory, but would result in heavier server load (though probably not too bad) and much heavier bot runtimes. The bot now will only download thumbnails of likely duplicates. Storing a full backlog would require the bot to download all thumbnails from the new images gallery (as you say, they wouldn't all have to be stored, just hashes). Then for each run of the bot, the software must compare each new image against the full backlog. This search will quickly become very intensive, and isn't really something that I want to have running on my server (which I use for lots of other stuff).
These are all great ideas, but I think one of the strengths of the bot right now is that it's simple, and gets the job done fairly well. I've been running it (without making edits) for a week now, and have found over 500 duplicates, very few of which have been tagged by editors. In case you're interested, I'll post my log at User:Staeckerbot/log. Staecker 13:17, 22 February 2007 (UTC)[reply]
I agree that this bot is worthwhile in it's simplest incarnation and am in favor of it, however I feel compelled to provide some solutions to the problems you've mentioned even if they're never implemented. First, for comparing non identical images I may be naive but I don't think it's as revolutionary as you do, a hash function that grabs the upper bits instead of the lower ones would ignore small differences but give good results. It would certainly have some false positives but that's why I'm suggesting we tag them for a human to check.
As for storing images, the server space is relatively small, assuming 100,000 images (keep in mind we're talking about only storing new ones, not all images), and a 16 byte hash value (the same length as an md5 checksum) we're still only at a few megs of storage space. From what I can tell this bot already has to download every thumbnail, so there's no more downloading than before, and as for the searching, a hash table would make it an O(1) operation, so each new image would take a trivial amount of computation to compare and store. Vicarious 14:18, 22 February 2007 (UTC)[reply]
You're right, it wouldn't be so bad. Right now the bot only downloads thumbnails for images with identical file sizes, so it would download more often. But searching wouldn't be too bad- as you say. As for different resolutions, maybe you know more about it than I do. Thanks for the suggestions- I'll file them under "eventually". Staecker 21:59, 22 February 2007 (UTC)[reply]

As mentioned above, inexperienced users may upload a file twice. The consequence of this is that the licensing information may be on the "wrong" one, and get deleted by the bot, leaving the other to be dragged through process and deleted. Perhaps it would be possible to copy license information, where present, from the image to be deleted to the talk page of the one which will remain. Does this make sense, and seem worthwhile? Martinp23 23:08, 7 March 2007 (UTC)[reply]

It does seem worthwhile, and is a feature I could add in without too much work. Having the bot delete the older of the two versions will accomplish this task in most cases automatically, since the re-upload generally has more useful information (better naming, better licensing info). But it's something I'll look into. Staecker 00:09, 8 March 2007 (UTC)[reply]
Thanks :) Martinp23 18:39, 10 March 2007 (UTC)[reply]

What is the status of this bot? Are you ready for a trial? —METS501 (talk) 15:52, 17 March 2007 (UTC)[reply]

I was ready when I put in the request for approval. And by the way, speaking of approval... Staecker 16:10, 17 March 2007 (UTC)[reply]
OK, I'm very sorry for the delay. Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Make 100 or so edits and report back with diffs. —METS501 (talk) 16:11, 17 March 2007 (UTC)[reply]

Trial Period results[edit]

The bot has made about 100 nominations this weekend (!), and has logged them to User:Staeckerbot/Trial log. (I'm logging them all since they dissapear from Special:Contributions once the image gets deleted.) I'd say it's been pretty sucessful so far. A few times the admin who actually does the deletions has decided to choose "the other one", rather than delete the version that the bot has nominated. Apparently this was done in cases where the description or copyright information was different across the two versions.

In the case where image page information differs between the dupes, would it be helpful if the bot just dumped all of the text from one page to the other? I am imagining a note like "This image was uploaded twice with two different descriptions. The other description was:" and then inserting the other text. Perhaps these images could be added to a category like "Images with conflicting copyright tags" when appropriate. The big down-side of this, as far as I see it, is that the image page will become cluttered up with lots of potentially useless information which will be fairly confusing to the average user. But proper copyright tagging is important, and I think it would be good to flag images whose uploaders aren't clear exactly on which tag is appropriate.

A good middle-ground was suggested by User:Martinp23 above: dump the differing text into the talk page, with a little note in the image page itself that there's some extra info at the talk page. I think that this would do fine for image descriptions, but I'd want to put conflicting copyright tags front-and-center on the actual image page, since these should actually be acted upon.

Thanks for your comments, whoever's watching. Staecker 17:15, 19 March 2007 (UTC)[reply]

It looks like your bot is doing it's job well. Good work. HighInBC(Need help? Ask me) 17:25, 19 March 2007 (UTC)[reply]
Very encouraging indeed. I agree with you about the importance of making sure that discrepencies in licences are obvious, but share your concern about pages becoming over cluttered. Perhaps you could put, where it differs, the old file information into a hide/show box on the newer image page. This way, it's there, but not cluttering up the page. Of course, the downside is that the copyright tag problems aren't as obvious, though this should be solvable if you can put a big template on them (something indicating uncertain licensing - I'm not sure if one already exists), which would add the page to a category for sorting and fixing. Martinp23 20:52, 24 March 2007 (UTC)[reply]

Thanks for the advice. I've started putting hide/show boxes in there. See Image:668585518B m.jpg and Image:668585518 m.jpg for an example. The box comes from User:Staeckerbot/Duplicate-file-info- feel free to tweak the look and feel. Right now I'm just copying the info- I haven't yet written the code to flag different licensing tags. Staecker 20:12, 25 March 2007 (UTC)[reply]

Nice, but it is copying the speedy delete tag along with it. HighInBC(Need help? Ask me) 20:16, 25 March 2007 (UTC)[reply]
Right you are- I'll try to fix that. Staecker 20:18, 25 March 2007 (UTC)[reply]
Yes - very nice. It would help to get rid of any deletion tags, or (this may/may not work) put <includeonly>s around the whole copy/paste, which may prevent the images appearing in the deletion categories, but retain the tags for admin review. It would also be good to put a message in (or above) the header of the hide/show box to tell an admin exactly what to do (or perhaps just a wikilink to a help page). Thanks, Martinp23 23:47, 28 March 2007 (UTC)[reply]
One thing - could you get it to add the info and the deletion tag in the same edit? Makes things a bit cleaner :) Martinp23 23:50, 28 March 2007 (UTC)[reply]
I've stripped out any delete tags from the show/hide box, so it shouldn't put anything in the delete categories that doesn't belong (let me know if I'm wrong). I'm not sure exactly what you are thinking of in terms of a message to admins- you are free to edit my template at User:Staeckerbot/Duplicate-file-info. As for "same edit", it'll take a bit of reworking- in a few days maybe... Thanks for your comments. Staecker 00:52, 29 March 2007 (UTC)[reply]
"Same edit" is now done- had a few spare hours. Staecker 03:17, 29 March 2007 (UTC)[reply]
It seems that the bot is working perfectly - good work.  Approved. The bot will run with a flag - please keep the edit rate below 2 per minute until the flag is granted. Thanks, Martinp23 13:25, 8 April 2007 (UTC)[reply]


The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.