[erlang-patches] Add supervisor:start_child/3 to limit the number of children

Discussion:

unknown

2013-04-03 12:16:37 UTC

git fetch git://github.com/vances/otp.git supervisor_child_limit

https://github.com/vances/otp/commit/04f94f86e5495f29b61654d7f744ae3eeaca9297

With the addition of a Limit argument in start_child/3 the supervisor
will either start a child or return {error,child_limit}.

A supervisor behaviour process may have dynamically added children
started by other processes. The count of the number of children
could be retrieved from the process with count_children/1 before
starting another with start_child/2 if a maximum number of children
is to be maintained. This introduces an overhead of a round trip
message and the possibility of a race condition.

This is quite useful where a fixed size pool of processes might
otherwise be used and is suitable for cases where child workers are
started with high frequency.

--
-Vance

unknown

2013-04-03 12:23:31 UTC

Permalink

Post by unknown
git fetch git://github.com/vances/otp.git supervisor_child_limit
https://github.com/vances/otp/commit/04f94f86e5495f29b61654d7f744ae3eeaca9297
With the addition of a Limit argument in start_child/3 the supervisor
will either start a child or return {error,child_limit}.
A supervisor behaviour process may have dynamically added children
started by other processes. The count of the number of children
could be retrieved from the process with count_children/1 before
starting another with start_child/2 if a maximum number of children
is to be maintained. This introduces an overhead of a round trip
message and the possibility of a race condition.
This is quite useful where a fixed size pool of processes might
otherwise be used and is suitable for cases where child workers are
started with high frequency.

I think that if you're going to add start_child/3, then the third
argument should be a list of options, rather than some specific thing,
making it easy to add more options in the future without having to have
start_child/4 etc.

/Richard

unknown

2013-04-03 12:39:17 UTC

Permalink

On Wed, Apr 03, 2013 at 02:23:31PM +0200, Richard Carlsson wrote:
} I think that if you're going to add start_child/3, then the third
} argument should be a list of options, rather than some specific
} thing, making it easy to add more options in the future without
} having to have start_child/4 etc.

That does seem like a reasonable approach. I'll reformat my contribution
in that way if there's consensus on it's merit.

--
-Vance

unknown

2013-04-03 12:24:57 UTC

Permalink

Hello,
I have fetched your patch and as this introduces a new feature I have
assigned it to responsible team to decide if this a desired behaviour.
Thanks for your contribution,

--
BR Fredrik Gustafsson
Erlang OTP Team

unknown

2013-04-03 12:29:05 UTC

Permalink

Is the limit counted based on living children, or on the number of
children specifications currently active?

I am also not a fan of overloading the number '0' to mean "no limit"
instead of "0 children allowed". We have atoms, and we should make use
of them. Send in '{limited, N}' or 'unlimited', or 'infinity', or any
other token value that is 100% explicit about the intent instead of just
a '0' that people have to figure out what it means according to context
rather than what it explicitly says.

I think it would also be a good idea to check the type to be an integer
greater than 0 (instead of just greater than 0, like a list or a tuple
would be) at the call site rather than way too late, within the
supervisor.

If I were using limits like that to add a boundary to a pool, I'd worry
about how the tracking happens. It seems to me that doing it all in the
supervisor is surprisingly inefficient compared to having, say, a fixed
pool, or a gen_server that monitors workers and starts them for you so
the counter is implicit and direct rather than linear based on the
number of children, and repeated on every single call. Just how often
are you calling for new children to be added to the pool, and why does
it happen so often as to be a problem? Does it happen frequently enough
to be a problem, but infrequently enough for the supervisor to be able
to do it without degrading its service?

I don't have specific opinions for or against the feature itself and
would defer to the OTP committee to judge its worth -- just gave your
code a quick review :)

Regards,
Fred.

unknown

2013-04-03 12:53:25 UTC

Permalink

On Wed, Apr 03, 2013 at 08:29:05AM -0400, Fred Hebert wrote:
} Is the limit counted based on living children, or on the number of
} children specifications currently active?

Not living, just child specifications. Checking for living children
is too heavy weight here. The supervisor had been enhanced to use
dict() or sets() to store dynamic children with the addition of
count_children/1. So checking the size of the child list is O(1).
If you need to know how many are alive count_children/1 can be used.

} I am also not a fan of overloading the number '0' to mean "no limit"
} instead of "0 children allowed". We have atoms, and we should make use
} of them. Send in '{limited, N}' or 'unlimited', or 'infinity', or any
} other token value that is 100% explicit about the intent instead of just
} a '0' that people have to figure out what it means according to context
} rather than what it explicitly says.

I only used '0' internally, the API insists on a pos_integer(). I did
think the OTP team might want the internal interface to be backward
compatibile so I was expecting some change there.

} I think it would also be a good idea to check the type to be an integer
} greater than 0 (instead of just greater than 0, like a list or a tuple
} would be) at the call site rather than way too late, within the
} supervisor.

Yes, you are right about that. Noted.

} If I were using limits like that to add a boundary to a pool, I'd worry
} about how the tracking happens. It seems to me that doing it all in the
} supervisor is surprisingly inefficient compared to having, say, a fixed
} pool, or a gen_server that monitors workers and starts them for you so
} the counter is implicit and direct rather than linear based on the
} number of children, and repeated on every single call.

Not at all! My sore spot involves high frequency spawning of short lived
processes and I cannot afford the sort of overheads you are suggesting.
My solution is quite low overhead, the intent is to have the lowest possible.

} Just how often are you calling for new children to be added to the pool,
} and why does it happen so often as to be a problem? Does it happen
} frequently enough to be a problem, but infrequently enough for the
} supervisor to be able to do it without degrading its service?

To make the supervisor do it would require changing the SupSpec which
isn't going to happen. Also it is more flexible to set a limit each
time as it may change over time.

--
-Vance

unknown

2013-04-09 14:10:27 UTC

Permalink

Hi Vance!

Is this feature mostly for use with simple_one_for_one supervisors? I
believe (possibly faulty??) that these are the most common supervisor for
which many children are repeatedly restarted as a part of normal execution.
Could you say something more about your use case? Other input on this?

Regarding the implementation, one could say that the behavior differs a
little bit for different types of supervisors and child restart types.
(This is of course due to the nature of the supervisor/child types):

* simple_one_for_one supervisor (all child restart types) -> Limit is
compared to the number of alive children.
* other supervisor type, temporary children -> Limit is compared to the
number of alive children.
* other supervisor type, non-temporary children -> Limit is compared to the
number of child specs.

The very least we need to consider is if this is the correct behavior, and
if so I think it needs to be mentioned in the documentation. Or should the
feature be restricted to simple_one_for_one supervisors (or is it only me?)?

Regards
siri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-patches/attachments/20130409/4279a647/attachment.html>

unknown

2013-04-10 12:08:12 UTC

Permalink

On Tue, Apr 09, 2013 at 04:10:27PM +0200, Siri Hansen wrote:
} Is this feature mostly for use with simple_one_for_one supervisors?

My current use case scenario is, yes.

} I believe (possibly faulty??) that these are the most common supervisor for
} which many children are repeatedly restarted as a part of normal execution.

Yes, I would agree.

} Could you say something more about your use case? Other input on this?

In my case we are creating a child worker process to manage the lifecycle
of a transaction. We process thousands of transactions per second. A
transaction may take tens of milliseconds or tens of seconds. We require
a limit on the number of possible ongoing transactions.

} Regarding the implementation, one could say that the behavior differs a
} little bit for different types of supervisors and child restart types.

It differs a little bit. My description was:

"If Limit or more children are already specified for the supervisor
start_child/3 returns {error,child_limit}."

My implementation counts the child specifications.

} * simple_one_for_one supervisor (all child restart types) -> Limit is
} compared to the number of alive children.
} * other supervisor type, temporary children -> Limit is compared to the
} number of alive children.
} * other supervisor type, non-temporary children -> Limit is compared to the
} number of child specs.

Effectively yes, however it's really just the number of child_spec() as
we are not using is_process_alive/1 to prove it. Really it's a limit on
the number of child_spec() which may exist.

} The very least we need to consider is if this is the correct behavior, and
} if so I think it needs to be mentioned in the documentation. Or should the
} feature be restricted to simple_one_for_one supervisors (or is it only me?)?

I didn't see any reason to restrict it to simple_one_for_one supervisors.
You could use supervisor:start_child/2 to add thousands of children to a
one_for_one supervisor. If you were doing that you might just want to
limit the number of such child_spec() being added.

What do you think about Richard's suggestion that start_child/3 should
take an Options::list() argument instead with {child_limit, N::pos_integer()}
as the only currently defined option?

--
-Vance

unknown

2013-04-10 12:51:11 UTC

Permalink

Post by unknown
In my case we are creating a child worker process to manage the lifecycle
of a transaction. We process thousands of transactions per second. A
transaction may take tens of milliseconds or tens of seconds. We require
a limit on the number of possible ongoing transactions.

Have you considered using ETS counters, and possibly a monitor process?
The idea being that if you have thousands of connections, trying to
increment an ETS counter outside of the supervision structure?

In my experience with whatever ended up being high throughput or low
latency, what could kill you was not the fact that the counter was
necessarily high, but how much contention there is to it.

If you're in the kind of position where you need to limit the number of
transactions to avoid falling over, it will *not* reduce the number of
messages sent to the supervisor, and if you start going over the top,
you'll time out no matter what, just because the supervisor won't be
able to keep up with the demand.

It takes a while before reaching that level, but in these cases, what I
end up doing most of the time is holding an ETS counter that maintains
itself at most to the max level given. Increment the counter as an
atomic operation (a write operation that also reads, so you benefit from
{read_concurrency,true} as an option). Assuming an entry of the form
{transactions, N}:

-spec can_start(ets:tid()) -> boolean().
can_start(Table) ->
%% the counter should start at 0 when initiating things
MaxValue = application:get_env(your_app, max_trans),
MaxValue > ets:update_counter(Table,
transactions,
{2, 1, MaxValue, MaxValue}).

Using that command, the max value will be easily configurable, will keep
a ceiling set to the max value in there, and will be much, much faster
to deny (and accept) requests while keeping your supervisor less loaded.

Now what you'll need is a monitor process that will be able to decrement
the counter for you when you're done, but only with processes that
managed to get started. The management stuff can forget all about the
processes that couldn't get in there. In practices, it works very well,
and I've used a similar architecture for dispcount
(https://github.com/ferd/dispcount), which has been used in production
for over a year for low-latency scenarios. Now dispcount uses a fixed
pool size and *is* a pool, but the same mechanisms can be applied to a
more central system where one main counter is used.

This will, in my experience, be more scalable as an approach than
modifying supervisors' internal state and relying on it. In the
benchmarks we ran at the job where I wrote dispcount, a single process
could chug on maybe 9000 messages a second before starting to get
swamped and using more resources than necessary (I can't remember what
hardware I used for the benchmark). Using the ETS approach on the same
hardware, I wasn't able to even get to the point where it was
problematic -- allocation of processes to generate contention and
gathering statistics turned out to be a bigger bottleneck.

That's without counting that using ETS counters, getting a response back
was a matter of microseconds, or had peak times under 5ms. Using
messages, it was very easy to see roundtrip times well above 70ms, and
those were with dedicated processes, not processes like supervisors that
also need to do a lot of other stuff.

As I said, it is more scalable and more performant. It is, however, not
available out of the box.

Regards,
Fred.

unknown

2013-04-10 13:03:37 UTC

Permalink

[...] Increment the counter as an atomic operation (a write operation
that also reads, so you benefit from {read_concurrency,true} as an

Sorry, this should read {write_concurrency,true}. Using update_counter/3
is a write-only operation that both lets you change and get the value of
the counter. I tend to segregate such counters to their own table (or
use read operations very, very selectively) as to keep the switching
between read and write modes as low as possible.

Regards,
Fred.

unknown

2013-04-11 05:47:42 UTC

Permalink

On Wed, Apr 10, 2013 at 08:51:11AM -0400, Fred Hebert wrote:
} Have you considered using ETS counters, and possibly a monitor process?
...
} If you're in the kind of position where you need to limit the number of
} transactions to avoid falling over, it will *not* reduce the number of
} messages sent to the supervisor, and if you start going over the top,
} you'll time out no matter what, just because the supervisor won't be
} able to keep up with the demand.

Fred,

Your approach is quite valid however it addresses an issue I am not as
yet considering. I am concerned not about overload protection but in
policy enforcement. The supervisor should have no more than N workers.
The correct place to address that issue is in the supervisor. True, I
could address it otherwise but I propose a small change to support this
in the OTP implementation.

The alternative solution which my coworkers have historically used is
long lived pools of processes. I believe that the Erlang way is to have
a process with a life cycle matching the transaction's. It makes me much
happier to eliminate the pools.

--
-Vance

unknown

2013-04-12 16:53:07 UTC

Permalink

Hi Vance,

sorry for the delayed answer. We have had some discussions within our team
and we do find your idea interesting. We do, however, not really like the
idea of setting the limit in the call to supervisor:start_child, but rather
think it should be possible to set such a property on the supervisor or in
the child spec. The drawback of this, as you say in an earlier mail, is
that it would require changing the supervisor spec - and that is a much
bigger change. We do, however, plan to make the supervisor API a bit more
flexible and by that adding the possibility of introducing new properties.

The timeframe for this is not yet set, and a contribution would of course
help speeding things up :) We already got a patch a good year ago (
http://erlang.org/pipermail/erlang-patches/2012-January/002574.html) but it
was never completed...

/siri

2013/4/11 Vance Shipley <vances>

Post by unknown
} Have you considered using ETS counters, and possibly a monitor process?
...
} If you're in the kind of position where you need to limit the number of
} transactions to avoid falling over, it will *not* reduce the number of
} messages sent to the supervisor, and if you start going over the top,
} you'll time out no matter what, just because the supervisor won't be
} able to keep up with the demand.
Fred,
Your approach is quite valid however it addresses an issue I am not as
yet considering. I am concerned not about overload protection but in
policy enforcement. The supervisor should have no more than N workers.
The correct place to address that issue is in the supervisor. True, I
could address it otherwise but I propose a small change to support this
in the OTP implementation.
The alternative solution which my coworkers have historically used is
long lived pools of processes. I believe that the Erlang way is to have
a process with a life cycle matching the transaction's. It makes me much
happier to eliminate the pools.
--
-Vance

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-patches/attachments/20130412/c3ca9d4c/attachment.html>

Siri Hansen

2015-01-15 15:00:34 UTC

Permalink

Hi Vance,

I'm happy to announce that in OTP-18, the supervisor API will be improved
to allow supervisor flags and childspecs declared as maps (similar to
http://erlang.org/pipermail/erlang-patches/2012-January/002574.html, except
maps are used instead of proplists). This will allow for additions to these
data structures. We now wonder if the functionality handled in the current
thread is still wanted, and if so if the patch should be rewritten to use
the new API?

Regards
/siri

Post by unknown
Hi Vance,
sorry for the delayed answer. We have had some discussions within our team
and we do find your idea interesting. We do, however, not really like the
idea of setting the limit in the call to supervisor:start_child, but rather
think it should be possible to set such a property on the supervisor or in
the child spec. The drawback of this, as you say in an earlier mail, is
that it would require changing the supervisor spec - and that is a much
bigger change. We do, however, plan to make the supervisor API a bit more
flexible and by that adding the possibility of introducing new properties.
The timeframe for this is not yet set, and a contribution would of course
help speeding things up :) We already got a patch a good year ago (
http://erlang.org/pipermail/erlang-patches/2012-January/002574.html) but
it was never completed...
/siri

Vance Shipley

2015-01-16 11:17:50 UTC

Permalink

I'm happy to announce that in OTP-18, the supervisor API will be improved to allow supervisor flags and childspecs declared as maps (similar to http://erlang.org/pipermail/erlang-patches/2012-January/002574.html, except maps are used instead of proplists). This will allow for additions to these data structures. We now wonder if the functionality handled in the current thread is still wanted, and if so if the patch should be rewritten to use the new API?

Yes, indeed it is. I will submit a new patch.

--
-Vance

Tuncer Ayaz

2015-01-16 12:06:32 UTC

Permalink

Post by Siri Hansen
I'm happy to announce that in OTP-18, the supervisor API will be
improved to allow supervisor flags and childspecs declared as maps
(similar to
http://erlang.org/pipermail/erlang-patches/2012-January/002574.html,
except maps are used instead of proplists). This will allow for
additions to these data structures.

To be totally clear, it will be optional and require 18.x, but
existing code will still work, right?

Siri Hansen

2015-01-16 15:16:08 UTC

Permalink

sorry, forgot to copy the list...

---------- Forwarded message ----------
From: Siri Hansen <***@gmail.com>
Date: 2015-01-16 14:07 GMT+01:00
Subject: Re: [erlang-patches] Add supervisor:start_child/3 to limit the
number of children

Post by Tuncer Ayaz

To be totally clear, it will be optional and require 18.x, but
existing code will still work, right?

That is correct! It was merged to master on November 6th - this is the
merge commit:
https://github.com/erlang/otp/commit/eeba41c3f9a56948d62de66065851f509ae02b43
- if you want to look at the details...
/siri

Siri Hansen

2015-01-16 13:15:09 UTC

Permalink

Post by Siri Hansen
I'm happy to announce that in OTP-18, the supervisor API will be

improved to allow supervisor flags and childspecs declared as maps (similar
to http://erlang.org/pipermail/erlang-patches/2012-January/002574.html,
except maps are used instead of proplists). This will allow for additions
to these data structures. We now wonder if the functionality handled in the
current thread is still wanted, and if so if the patch should be rewritten
to use the new API?
Yes, indeed it is. I will submit a new patch.

Great! As you know we need to be extra careful when including new
functionality in the OTP base, so we might need to go another round with
OTP Technical Board. So to avoid wasting time it might be a good idea to
give us a short description of the design (including API) before
implementing the details and we will give you feedback as soon as possible.

/siri

unknown

2013-04-12 19:31:30 UTC

Permalink

This is very easy to do with a custom gen_server, literarily a few lines
of code, for the common case that the "child spec" is known in advance and
fixed, and worker processes are relatively short-lived. Simulatenously it
gives you a convenient place for some management logic, often handy.

I imagine this (the above constraints) might be your case? Can isolate a
minimal example if desired.

To be clear: by "a few lines of code" I'm referring to a version that does
spawn rate limiting, fault rate limiting and concurrency control. For cases
that need to do more involved regulation a gen_fsm works better, but the
general approach is the same.

Just wondering if it's worth further increasing supervisor(3) complexity.

BTW The same argument and concern also applies to simple_one_for_one, or
at least to its prevalent use cases. But that's already in.

BTW For cases where child processes are spawned at high frequencies I find
it counterproductive to use supervisors -- SASL log gets flooded with what
in practice tends to be pure noise, obscuring any important going ons. One
more small argument for tailor-made solution.

BR,
-- Jachym

Continue reading on narkive:

Search results for '[erlang-patches] Add supervisor:start_child/3 to limit the number of children' (Questions and Answers)

replies

Babies and children in the backseat law....?

started 2010-11-13 22:17:08 UTC

pregnancy & parenting

replies

do you really hate these phone calls that start, you have won or,whatever, unsolicited rubbish?

started 2010-09-12 12:16:20 UTC

senior citizens

replies

How did Southerners respond to northerners attacks on slavery?