The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.

Operator: Wronkiew (talk)

Automatic or Manually Assisted: Automatic

Programming Language(s): PHP

Function Summary: Update Wikipedia:DUSTY

Edit period(s) (e.g. Continuous, daily, one time run): Daily

Already has a bot flag (Y/N): N

Function Details: The list of dusty pages, linked from SpecialPages Maintenance reports, is several months out of date. The list of should be regenerated when people update the pages. Practically, this could be done once per day. DustyBot will do this in two stages. The first stage will generate a list of ~10,000 dusty pages from the most recent database dump. This requires tens of thousands of page accesses to search for and disregard disambiguation pages. Fortunately, this only needs to be done when a new database dump is available, which happens once every couple of months. The list will be built over the course of several days, keeping page accesses below 10/min. The second stage will scan this list once per day for the 100 pages that are still dusty, and post that at Wikipedia:Dusty articles. Because this bot will only edit Wikipedia once per day, and will only change one hard coded page, the risk of interfering with other editors is low. I am interested in hearing ideas about how to reduce the number of page accesses.

Discussion[edit]

Which db dump are you using? If the dump doesn't contain the page text you should be able to generate lists of disambig pages from the templates at MediaWiki:Disambiguationspage using the API or ask someone with toolserver access to do a query. Also I'm somewhat confused, it "will generate a list of ~10,000 dusty pages" then scan "for the 100 pages that are still dusty." What will it actually be reporting on Wikipedia? Mr.Z-man 07:24, 13 October 2008 (UTC)[reply]

I'm using page.sql.gz from the 10/08 dump. I could process pages-articles.xml.bz2 instead, which would eliminate the need to check individual pages, but that would mean downloading a 4 GB file instead of tens or hundreds of MB. The list of 10,000 potentially dusty pages is just the first stage and is not posted to Wikipedia. The pages on that list are either really dusty or have been very recently edited. The second stage goes through that list, weeding out the recently updated ones, until it has a list of 100 pages. That list of checked pages will be posted to Wikipedia. Wronkiew (talk) 15:59, 13 October 2008 (UTC)[reply]
Okay, never mind the list of templates, you should just be able to generate a list from Category:All disambiguation pages. It currently has 361,239 pages, so at 5000 pages per request, it should only take ~29 API requests (the counts aren't always accurate), you can also ask someone to do a query on the toolserver database, or use the categorylinks.sql.gz dump. Mr.Z-man 20:47, 13 October 2008 (UTC)[reply]
Thank you, your advice to download the category with API requests was very helpful. I have manually updated Wikipedia:DUSTY with the generated list. Wronkiew (talk) 16:29, 14 October 2008 (UTC)[reply]
Also, I noticed that there is a date in the table at Maintenance reports that will need to be updated by DustyBot. Wronkiew (talk) 16:33, 14 October 2008 (UTC)[reply]
Approved for trial (5 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Whenever you're ready. Mr.Z-man 23:21, 16 October 2008 (UTC)[reply]
Ready to start testing. Wronkiew (talk) 04:38, 17 October 2008 (UTC)[reply]
Trial complete. DustyBot is disabled. Wronkiew (talk) 17:51, 21 October 2008 (UTC)[reply]
What is the point of edits to the /Updated page? BJTalk 03:25, 25 October 2008 (UTC)[reply]
It is transcluded into Wikipedia:DUSTY and Wikipedia:Maintenance#Reports. DustyBot updates that page to avoid edit conflicts and page parsing. Wronkiew (talk) 03:32, 25 October 2008 (UTC)[reply]

 Approved. BJTalk 03:33, 25 October 2008 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.