The contents of the Mish (function) page were merged into Rectifier (neural networks) on 29 July 2022. For the contribution history and old versions of the redirected page, please see its history; for the discussion at that location, see its talk page.

This is the talk page for discussing improvements to the Rectifier (neural networks) article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Daily pageviews of this article

A graph should have been displayed here but graphs are temporarily disabled. Until they are enabled again, visit the interactive graph at pageviews.wmcloud.org

Unjustified Claim That is Central to Topic[edit]

In the "Advantages" section, the article states that "Rectified linear units, compared to sigmoid function or similar activation functions, allow faster and effective training of deep neural architectures on large and complex datasets.", which is quite a bold claim, and there is no justification. I don't necessarily think this is incorrect, in fact, I think it is correct, but I think evidence would be very helpful. — Preceding unsigned comment added by Gsmith14 (talk • contribs) 00:55, 8 August 2022 (UTC)[reply]

Untitled[edit]

Use of the softplus function as an approximator of the rectifier function is not warranted by any of the current four references. Note also that the softplus is a fairly bad approximation for values roughly between -2 and 2, while in general we can expect such small values. I propose to remove the section about the softplus. Angelorf (talk) 09:41, 2 July 2013 (UTC)[reply]

See page 4 of Glorot, Bordes and Bengio: "Rectifier and softplus activation functions. The second one is a smooth version of the first.". QVVERTYVS (hm?) 10:45, 2 July 2013 (UTC)[reply]

For a single input, the softplus acts as a smoothed version of the ReLU, but not for multiple inputs, if defined as in the text:

log(1 + exp(x + y)) ≠ log(1 + exp(x) + exp(y)).

"SmoothReLU" is defined by Intel with the same formula, but probably behaves more like a ReLU for multiple inputs, assuming the inputs are summed before going through the nonlinearity? It doesn't explicitly say so, though. Also, the only Google Scholar results for "SmoothReLU" are [1] and [2], which both are referencing [3], which does not use the exact term and uses a different but very similar function: "We adopted instead a smoothed version of the rectifier nonlinearity which is differentiable everywhere and still 0 when the input is negative f(x) = max(0, x − a tanh x/a)" — Omegatron (talk) 01:59, 4 December 2018 (UTC)[reply]

Propose removing Stub status[edit]

This article was marked as a stub in 2012. It seems to have enough information to no longer be considered a stub. I propose removing the "stub" marking at the end of the article.Cajunbill (talk) 07:53, 20 March 2015 (UTC)[reply]

done, thanks --mcld (talk) 08:51, 6 November 2015 (UTC)[reply]

Image is really really confusing when you look closely at 0[edit]

I read that the function is not differentiable at 0 which confused me as I was looking at the image. Then I read the actual function max(0, x) and realized that the image is flawed. Please upload a non-flawed image. — Preceding unsigned comment added by 24.4.21.209 (talk) 00:17, 24 August 2015 (UTC)[reply]

Fixed, thanks--mcld (talk) 14:27, 14 December 2015 (UTC)[reply]

Recent undos[edit]

@User:Ita140188 why did you revert my edit? I just changed one word to clearify this misleading sentence:

"Non-differentiable at zero; however it is differentiable anywhere else, and a ~~value~~ slope of 0 or 1 can be chosen arbitrarily to fill the point where the input is 0."

This sentence is wrong. You cannot choose the value because it's defined by the function. The value at $x=0$ must always zero, because this results from the definition $f(x)=\max(0,x)\Rightarrow f(0)=0$

Whats wrong with my edit? Some explanation would be nice. --2003:CB:770E:3983:F88F:E5D4:E07C:7E39 (talk) —Preceding undated comment added 14:14, 2 April 2019 (UTC)[reply]

The sentence is about the derivative of the function. This function is not differentiable at 0. For values less than 0 the derivative is 0, for positive values is 1, so for zero one can choose 0 or 1. --Ita140188 (talk) 14:21, 2 April 2019 (UTC)[reply]

But that's misleading because it's not clear if you talk about the derivative or the function itself: "Non-differentiable at zero (the function!); however it (the function!) is differentiable anywhere else, and a value (function or derivative, that's not clear because the first part was about the function and that it's not differentiable at zero. So function + Value => value of function. I hope you got what I mean.
I think it would be better to write the value of the derivative or just slope. --2003:CB:770E:3983:F88F:E5D4:E07C:7E39 (talk) 14:37, 2 April 2019 (UTC)[reply]

I tried to make the text more clear. "Slope" is not accurate and should not be used in a technical article. --Ita140188 (talk) 16:42, 2 April 2019 (UTC)[reply]

The term "slope" is accurate: pretty much every math book I've seen that describes derivatives in English uses exactly that term.

The WP:TECHNICAL guideline encourages us to use such common terms instead of technical terms where they are equivalent.

Thank you for continuing to try to make Wikipedia both technically accurate and also easier to understand. --DavidCary (talk) 21:03, 15 November 2021 (UTC)[reply]

The Switch Viewpoint[edit]

The ReLU can be viewed as a switch rather than an activation function. Then for a particular input to the neural network each switch is in a particular state. Thrown or not thrown and weighted sums connect together (or not) in certain ways. A weighed sum of weighed sums is still a linear system. Hence there is a particular linear projection from the input to the output for a particular input and a within neighborhood around that input such that no switch changes state. A ReLU neural network then is a system of switched linear projections. Since ReLU switches at zero there is no sudden discontinuity in the output as the input changes gradually, despite switches being definitely thrown on or off. For a particular output neuron and a particular input there is a linear composite of weighted sums giving the output value. Those multiple weighted sums can be combined into a single weighted sum. You could then go and see what single weighed sum was looking at in the input. And there are a number of metrics you look at such as the angle between the input vector and the weight vector of the single weighed sum. S. O'Connor — Preceding unsigned comment added by 113.190.132.240 (talk) 16:20, 25 September 2019 (UTC) Further information and examples: https://ai462qqq.blogspot.com/2019/11/artificial-neural-networks.html For example the dot products being switched need not be simple weighted sums, they could derive from fast transforms like the FFT. S. O'Connor — Preceding unsigned comment added by 14.162.218.184 (talk) 08:42, 8 April 2020 (UTC)[reply]

invention of the relu?[edit]

Currently the article focuses on the publication of Hahnloser as the earliest usage of the ReLU, but should we also mention Fukushima, who already used this activation function in the "Neocognitron"-network about twenty years earlier? (see eq (2) https://www.rctn.org/bruno/public/papers/Fukushima1980.pdf and also the discussion in https://stats.stackexchange.com/questions/447674) --Feudiable (talk) 14:38, 29 May 2020 (UTC)[reply]

Signed min and signed max.[edit]

In a field of real time rendering of signed distance field (SDF)s, there is the concept of signed max and signed min. It is an operator, or actually just a function that smoothly blends two values ensuring both smooth transition between them, and continues first derivative, it is usually defined as ${\textrm {smin))_{k}(a,b)=\min(a,b)-\max(0,k-|a-b|)^{2}/(4k)$ . The $k$ is a smoothing factor. It has a property of ${\textrm {smin))_{k}(a,b)=a\,{\textrm {for))\,a<b-k$ , ${\textrm {smin))_{k}(a,b)=a\,{\textrm {for))\,b>a+k$ , $\lim _{k\to 0}{\textrm {smin))_{k}(a,b)=\min(a,b)$ (uniformly for all $a$ and $b$ ), and equal left and right derivatives at the point of $a=b$ . Similar can be made for ${\textrm {smax))_{k}(a,b)=\max(a,b)+\max(0,k-|a-b|)^{2}/(4k)$ . A RELU can be easily implemented efficiently using ${\textrm {smin))$ function, as ${\textrm {RELU))(x)={\textrm {smin))_{1}(0,x)$ . However, this might not be suitable for practical uses, because this function is 0, and has all derivatives 0, for x < -k. So many solvers that use derivatives will not work too well. 81.6.34.172 (talk) 17:05, 31 May 2020 (UTC)[reply]

Biological plausibility and non-sequitor[edit]

I've removed the following bullet, which was listed as an "advantage" of ReLU:

Biological plausibility: One-sided, compared to the antisymmetry of tanh.^{[non sequitur]}

An unbounded function is certainly not biologically plausible (the firing rate of a neuron has an upper limit). And the comment about tanh was marked as a non sequitur by someone last October. So it looks useful to delete this bullet.

Is ReLu ever more than an activation function?[edit]

I removed the following text from the article:

A unit employing the rectifier is also called a rectified linear unit (ReLU).^[1]

I originally had replaced it with this text, but then I chose to remove the paragraph entirely:

Originally, the term rectified linear unit (ReLU) referred to both the linear (fully-connected) layer and the activation function together,^[1] but it has become common to refer to just the activation function as the ReLU.^[2]^[3]

I'm not sure what the author of the original text meant by a "unit." When I read their reference, I could not discern whether it meant an activation function or a connection function (such as a fully-connected or convolutional function) followed by an activation function. Does anyone have expertise on this? Thank you! --Yoderj (talk) 18:32, 8 April 2021 (UTC)[reply]

References

^ ^a ^b Rectified Linear Units Improve Restricted Boltzmann Machines (PDF). ICML. 2010. ((cite conference)): Cite uses deprecated parameter |authors= (help)
^ Cite error: The named reference brownlee was invoked but never defined (see the help page).
^ Cite error: The named reference medium-relu was invoked but never defined (see the help page).

Graphs of Each Function[edit]

it would be nice to have an image showing a graph of each function inline with the equations. --157.131.95.172 (talk) 21:25, 20 July 2021 (UTC)[reply]

Sparse activation is an advantage?[edit]

"Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (have a non-zero output)."

Isn't this a disadvantage? Half the neurons are wasted and computed for no reason, and contribute nothing to the output, making the model less accurate? — Omegatron (talk) 18:59, 25 September 2023 (UTC)[reply]

Another variant function[edit]

The following function could work also:

\ln(1+e^{x})\approx {\begin{cases}{\frac {x}{1-e^{-{\frac {x}{\ln(2)))))},\quad x\neq 0\\\ln(2),\quad x=0\end{cases))

You could see its basic properties here

Maybe someone more experienced in ReLu could add it. 45.181.122.234 (talk) 22:41, 26 March 2024 (UTC)[reply]

"Rectifier (neural networks" listed at Redirects for discussion[edit]

The redirect Rectifier (neural networks has been listed at redirects for discussion to determine whether its use and function meets the redirect guidelines. Readers of this page are welcome to comment on this redirect at Wikipedia:Redirects for discussion/Log/2024 April 9 § Rectifier (neural networks until a consensus is reached. Utopes _{(talk / cont)} 01:49, 9 April 2024 (UTC)[reply]