On our ASP server, we're offering access using ssh to our customers.
Each linux login has one or more fixed users on QM and they get
connected directly to qm -AACCOUNT -12 for example using a bash login
procedure.
We have both keepalive on the ssh server directly and on the client
part in the emulator. It works great in 95% of the situation but
sometimes some lines got stuck when network link breaks.
This morning 3 lines where broken and QM and the only solution was to
restart the whole QM. Here are some informations :
We have seen something similar where QM is not notified of loss of the network connection and the process hangs inside a Linux library call where we cannot see the logout request.
Although we need a better solution for this, you should be able to kill the QM processes from Linux rather than a complete restart. Our cleanup mechanism will then recover the licences within five minutes. You can speed this up by doing qm -cleanup
I have forwarded your email to one of our dealers who has identified a problem in Linux ssh that might explain this.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
> On our ASP server, we're offering access using ssh to our customers.
> Each linux login has one or more fixed users on QM and they get
> connected directly to qm -AACCOUNT -12 for example using a bash login
> procedure.
> We have both keepalive on the ssh server directly and on the client
> part in the emulator. It works great in 95% of the situation but
> sometimes some lines got stuck when network link breaks.
> This morning 3 lines where broken and QM and the only solution was to
> restart the whole QM. Here are some informations :
> We have seen something similar where QM is not notified of loss of the > network connection and the process hangs inside a Linux library call where > we cannot see the logout request.
So I should just kill -9 the qm process on linux and then qm -cleanup ?
On 5 Oct, 17:38, Cedric Fontaine <cfonta...@spidmail.net> wrote:
> So I should just kill -9 the qm process on linux and then qm -cleanup ?
Although use of kill -9 is not a good idea in most situations, it
should be safe to do when QM is not responding to other termination
requests.
We do need to understand this problem more fully and come up with a
solution though all the evidence we have so far suggests that the hang
is deep inside the Linux networking system and hence outside of our
control.
I can see it now - Cedric files a support request with his Linux provider (because we all know he's paying for support on his FOSS, right?) and he tells them his DBMS provider says there is a bug in the networking system. Yes, and the issue will be resolved quickly as a million highly motivated people devote their free time to solving the problem. Somehow I don't think Cedric is going to get a resolution to this issue anytime soon.
I'm sorry Martin, I really don't expect you to be resolving Linux issues, but I do see a great deal of irony in all of this.
> From: Martin Phillips > Although use of kill -9 is not a good idea in most > situations, it should be safe to do when QM is not > responding to other termination requests.
> We do need to understand this problem more fully and > come up with a solution though all the evidence we > have so far suggests that the hang is deep inside the > Linux networking system and hence outside of our > control.
> I can see it now - Cedric files a support request with his Linux > provider (because we all know he's paying for support on his > FOSS, right?) and he tells them his DBMS provider says there is a > bug in the networking system. Yes, and the issue will be > resolved quickly as a million highly motivated people devote > their free time to solving the problem. Somehow I don't think > Cedric is going to get a resolution to this issue anytime soon.
> I'm sorry Martin, I really don't expect you to be resolving Linux > issues, but I do see a great deal of irony in all of this.
> T
>> From: Martin Phillips >> Although use of kill -9 is not a good idea in most >> situations, it should be safe to do when QM is not >> responding to other termination requests.
>> We do need to understand this problem more fully and >> come up with a solution though all the evidence we >> have so far suggests that the hang is deep inside the >> Linux networking system and hence outside of our >> control.
I sometimes get this sort of finger pointing. It often happens on Windows systems, and another MV database that I use. Nice to see the same thing happening with Linux. Don't want the FOSS people missing out! ;-)
Anyway, I've found an effective way to stop the finger pointing is to ask the person doing the pointing for EVIDENCE that the bug is where they say it is.
So, Martin. Do you have proof that the bug is in the Linux networking code?
> I'm sorry Martin, I really don't expect you to be resolving > Linux issues, but I do see a great deal of irony in all of this.
I agree that it is not our job but, in this particular instance, one of our dealers has identified and fixed a problem that sounds like it could be the same issue. I have asked him to communicate directly with Cedric (or perhaps via this list) and he has agreed to do so as soon as time permits.
Re Ashley's comment...
> So, Martin. Do you have proof that the bug is in the Linux > networking code?
We have seen two network connection problems that appear to be in Linux. The one that fits closest to Cedric's problem is where we hang inside a kernel function (as shown by strace) and never return to QM. This makes it difficult for us to catch the error.
The other one involves poll() or select() saying "yes, there is data waiting to be read" and read() saying "no there isn't", resulting in a loop trying to recover the non-existant data. We have worked around this one inside QM.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
> I agree that it is not our job but, in this particular instance, one of our
> dealers has identified and fixed a problem that sounds like it could be the
> same issue. I have asked him to communicate directly with Cedric (or perhaps
> via this list) and he has agreed to do so as soon as time permits.
I didn't receive any direct support for now. I must admit that we're
currently stopping our migration to QM on those servers for now as
this point is a show stopper. We didn 't get any new hangs since last
week but I'm not sure that a kill will help cause it's pretty much
what I've been doing.
Our experience with D3 is that it could happens also on D3 but a
logoff will just bring the line back, as in QM, it will breaks the
whole QM server. Is it possible at least to fix the LOGOUT problem in
this case ?
>> I agree that it is not our job but, in this particular instance, one of our >> dealers has identified and fixed a problem that sounds like it could be the >> same issue. I have asked him to communicate directly with Cedric (or perhaps >> via this list) and he has agreed to do so as soon as time permits.
> I didn't receive any direct support for now. I must admit that we're > currently stopping our migration to QM on those servers for now as > this point is a show stopper. We didn 't get any new hangs since last > week but I'm not sure that a kill will help cause it's pretty much > what I've been doing.
> Our experience with D3 is that it could happens also on D3 but a > logoff will just bring the line back, as in QM, it will breaks the > whole QM server. Is it possible at least to fix the LOGOUT problem in > this case ?
Just a thought...
If there's a suspected problem in the linux internals, then presumably this problem does not exist for QM on the Windows or BSD platforms. If that's the case, perhaps you can consider using QM on top of FreeBSD. Unless you are tightly tied to Gentoo.
We need to investigate this more fully. Please let us have full
details of how your connections are set up (direct into QM, via Linux
shell, ssh, etc) and the kernel revision in use.
A core dump of the process when it is stuck would be very helpful.
Failing this, please run strace to record the state of the process and
let us have the output.
It occurs to me that one of the best ways to get people to recognize FOSS OpenQM is to present a reproducible case to the Linux distro developers with OpenQM as the focal point. To fix the Linux problem they might need to install OpenQM, and in doing so they may want to know more about what it is. I hope it plays out like this.
In other words, Martin, it may be better to be less eager to fix this on your own, even if you can.
----- Original Message ----- From: "Martin Phillips" <MartinPhill...@ladybridge.com> To: "OpenQM" <openqm@googlegroups.com> Sent: Saturday, October 10, 2009 9:37 AM Subject: Re: Network link broken but user still connected
> Hi Cedric,
> We need to investigate this more fully. Please let us have full > details of how your connections are set up (direct into QM, via Linux > shell, ssh, etc) and the kernel revision in use.
> A core dump of the process when it is stuck would be very helpful. > Failing this, please run strace to record the state of the process and > let us have the output.
On 10 oct, 04:37, Martin Phillips <MartinPhill...@ladybridge.com>
wrote:
> Hi Cedric,
> We need to investigate this more fully. Please let us have full
> details of how your connections are set up (direct into QM, via Linux
> shell, ssh, etc) and the kernel revision in use.
Users are connecting via ssh and then are redirected to Qm using
bash_profile executing "/usr/qmsys/bin/qm -12 -AACCOUNT" for example.
Linux 2.6.28.4-xxxx-std-ipv4-32 #2 SMP Wed Feb 18 16:34:04 UTC 2009
i686 AMD Athlon(tm) X2 Dual Core Processor BE-2300 AuthenticAMD GNU/
Linux
> A core dump of the process when it is stuck would be very helpful.
> Failing this, please run strace to record the state of the process and
> let us have the output.
What are the command lines for core dump or strace ?
Sorry for the late answer. We didn't get any problem since then, but
we didn't change any settings either.