The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was  Approved.

Operator: Primefac (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 22:22, Thursday, February 1, 2018 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): AWB

Source code available: WP:AWB

Function overview: Fix broken URLs

Links to relevant discussions (where appropriate): BOTREQ request

Edit period(s): one time run

Estimated number of pages affected: ~1500

Namespace(s): All

Exclusion compliant (Yes/No): Yes

Function details: www.cwgc.org has changed their URL parameters, leaving a lot of pages with broken links. Simple find/replace:

Discussion

[edit]
etc.. a full list of archive types and URLs WP:WEBARCHIVES. -- GreenC 03:13, 2 February 2018 (UTC)[reply]
Good point. I've amended my code to include a lookbehind. Primefac (talk) 03:22, 2 February 2018 (UTC)[reply]
Would it overlap with template instances like |url=http://www.cwgc.org/find-war-dead/casualty/9898? -- GreenC 04:10, 2 February 2018 (UTC)[reply]
There are 5 instances out of over 1000 where the archive URL includes "?url=", so assuming that I remove those five from the list there shouldn't be an issue. I actually thought it would be the other way around... Primefac (talk) 04:15, 2 February 2018 (UTC)[reply]
Yeah those are WebCite, not too common. The domain was whitelisted by IABot at some point, so it hasn't been auto-archived (in the wiki anyway) which turned out to be a good thing. The 1000 with |url= might also have |deadurl=yes and ideally it would be set back to |deadurl=no but understandably that would be a more complex bot and not crucial. IABot might be able to detect and make the change not sure. @Cyberpower678: -- GreenC 04:55, 2 February 2018 (UTC)[reply]
IABot can, but won't.—CYBERPOWER (Chat) 22:52, 2 February 2018 (UTC)[reply]
The original request that I made didn't consider web archive URLs. If those instances need fixing manually before any bot run, I am happy to do that. Some URLs are just wrong (for various reasons), and will need manual checking to correct the ID numbers used. If it is possible to output a list of URLs that still return 404 errors, even after this correction is applied, that would help immensely. It is important to limit this to the casualty and cemetery URLs only, not the other URLs from the CWGC site (many of which are also broken). Good examples from the web archive links at Cemeteries and crematoria in Brighton and Hove of the change in the appearance of the CWGC pages from 2013 to the present: 1 vs 2 and 3 vs 4. For the appearance in 2011, see the archived link at Percy Charles Pickard: 5 Carcharoth (talk) 12:35, 2 February 2018 (UTC)[reply]
Couple more points. For some reason I don't understand, the bottom 44 or so links here are lacking any ID number at all. Those will need to be fixed manually. I also checked the 'https' links as well. Respective numbers are 163, 0, 51, 0. i.e. none of the 'aspx' links are https, and only a few of the correct ones are https. Carcharoth (talk) 21:02, 2 February 2018 (UTC)[reply]
If they don't have a numerical value in the URL, then they won't be picked up. The regex is only looking for digits. Primefac (talk) 15:08, 3 February 2018 (UTC)[reply]
I have have checked the list of incorrect URLs and those all have a numerical value in the URL. The ones with the numerical values missing are in the 'correct' form - I will fix them manually. Carcharoth (talk) 14:49, 4 February 2018 (UTC)[reply]
The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.