The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was  Approved.

Operator: Nemo_bis (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search) for this task; Pintoch (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search) as main owner and author of the bot

Time filed: 13:52, Thursday, July 25, 2019 (UTC)

Function overview: Add and maintain supported identifiers to citation templates (mostly ((cite journal))), including related metadata such as access level but excluding the |url= parameter.

Automatic: A queue of edits is created automatically (manually triggered), then a cursory review of its contents is performed manually to exclude anomalies, then select items are moved to a queue for the bot to perform them automatically. Edits are then sampled for manual checks and some manual fixes are performed by the operators in the few hours or days following a bot run on the pages which ended up on Category:CS1 maintenance (typically less than one in a thousand).

Programming language(s): Python

Source code available: https://github.com/dissemin/oabot / phabricator:tag/oabot/ (relying on https://github.com/dissemin/dissemin/ and https://github.com/Impactstory/oadoi )

Links to relevant discussions (where appropriate): Wikipedia talk:OABOT, Help_talk:Citation_Style_1#RfC_on_linking_title_to_PMC and similar for the desirability of identifiers and precise information on them.

Edit period(s): Once every few weeks or months.

Estimated number of pages affected: Less than 20k for the first steps; more than 300k overall considering all articles with DOIs.

Namespace(s): 0

Exclusion compliant (Yes/No): Yes

Adminbot (Yes/No): No

Function details: Following the success of task OAbot 2, we're proposing to extend the functionality of the bot to all identifiers. The addition of arxiv and PMC identifiers (about 25k edits) has been a success: it has encountered few mistakes and the bot has been made more robust in response (for instance we are now stricter in matching publications).

The first step will be to add |hdl= identifiers and |hdl-access= status on about 2k articles. Those handles typically point to an institutional repository like https://ntrs.nasa.gov/ or https://deepblue.lib.umich.edu/ (the most common in the queue is https://quod.lib.umich.edu/ for now). Citation bot is also able to add such identifiers, but does so more slowly and does not (yet) set access status, while we now do (T228632): example edit [1].

After this is done, other identifiers will be handled depending on demand and volumes. The most consequential work will be to eventually add |doi-access=free to all relevant citations (an estimated 200k DOIs): this functionality was part of the original request (and not challenged by anybody) but later dropped when the bot became a user-triggered tool, as the number of required edits is incompatible with human editing.

Expected improvements in the new future, if this task is approved, include:

Discussion[edit]

This might not be relevant yet, but I take that the bot won't add identifiers without some kind of procedure to reject unsuitable identifiers? This bot has had some copyright issues in the past. It also won't replace already existing URLs? Because that might be problematic under WP:SAYWHEREYOUGOTIT. Jo-Jo Eumerus (talk, contributions) 16:07, 25 July 2019 (UTC)[reply]
The current procedure to reject unwanted identifiers is to either blacklist the bot on the specific page with ((bots)) or comment out the identifier in the specific citation template. The proposed additional procedure is to let any user blacklist an identifier by means of linking it on a central subpage, so that it's no longer added to any other page: this will allow users to reject one, ten or a thousand identifiers with a single edit and have the community decide it by consensus.
This task proposes that no edits are made to the |url= parameter at all using the bot account. I'll note however that WP:SAYWHERE specifically states that «You do not have to specify how you obtained and read it. So long as you are confident that you read a true and accurate copy, it does not matter [...]». Nemo 16:24, 25 July 2019 (UTC)[reply]
I have no objection to adding hdl identifiers. But I am currently seeing huge numbers of OABot edits on my watchlist, making it difficult to find any other changes and impossible to manually check them for accuracy, and would be interested in knowing whether there are any plans for throttling the bot to a more reasonable rate of updates. Also, if the "other identifiers" to be added are to be included in this BRFA, they need to be specified explicitly. For instance, I would be opposed to automatically adding citeseerx identifiers automatically, for all the previously-discussed reasons, and wouldn't want this BRFA to be taken as sidestepping that discussion. —David Eppstein (talk) 17:57, 25 July 2019 (UTC)[reply]
On the first point, I agree we need a frank conversation on the scope of the task; I just suggest to avoid having the same conversation over and over for each new identifier. On the second, as far as I can see the bot has respected the typical rate limit of 12 edits per minute, but it would not be a problem to reduce the speed. Nemo 20:06, 25 July 2019 (UTC)[reply]
I support this task, but I'll let someone else from the BAG to do a review here. I'll note here that WP:BOTREQUIRE suggests 6 EPM for non-urgent tasks however. Headbomb {t · c · p · b} 14:10, 30 July 2019 (UTC)[reply]
One thing I would like to see is that zenodo support is added to CS1 templates. Headbomb {t · c · p · b} 14:13, 30 July 2019 (UTC)[reply]
Actually, Zenodo links were the reason why I did ask whether the bot won't add identifiers without some kind of procedure to reject unsuitable identifiers as we've had copyright problems and disputes about them. I am not sure if the problem was resolved, though. Jo-Jo Eumerus (talk, contributions) 14:20, 30 July 2019 (UTC)[reply]
This request, sic stantibus rebus, would not produce any addition of links to Zenodo, as there is no identifier for it. As for the existing identifier parameters, which evidently were added because the target websites are considered good resources rather than systematic copyright infringement rackets, the proposal is the blacklist of specific URLs above. The discussions you linked were often focused on hypothetical or apodictic statements, impossible to discuss constructively; if users instead can focus on explaining which URLs are bad for which reasons, a consensus will be easier to find. Nemo 19:56, 30 July 2019 (UTC)[reply]
I pushed a change to reduce the editing speed. Nemo 19:56, 30 July 2019 (UTC)[reply]
Back to the CiteSeerX issue, to rekindle the discussion: in my opinion it falls squarely under Wikipedia:Copyrights#Linking_to_copyrighted_works "It is currently acceptable to link to internet archives such as the Wayback Machine, which host unmodified archived copies of webpages taken at various points in time" for the cached PDFs, while the rest of the functions (citation graphs etc.) are uncontroversially helpful and unproblematic. Therefore the current policies support an automatic addition and we should only handle the rare exceptions where a link would be problematic: a blacklist is a possible technical solution, but we could consider other ideas. Nemo 17:08, 24 August 2019 (UTC)[reply]

Sounds great, thanks for the update. – SJ + 18:38, 16 August 2019 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.