Strange ack vs syn,ack


Recommended Posts

So got pulled into this weird problem. Can post more details in a bit. Waiting for full sniffs.

Having an issue where we see a syn in wireshark on the server x.x.110.210 from the client x.x.117.173 But but never see a syn,ack sent back, only some weird ass ack that I have never seen before.

post-14624-0-73472400-1410536197.png

So this is on a MQ server, they have a script for testing that these mq2538 errors they are getting that generates 1000 connections. They see a mq2538 a few times out of the 1000, most of the connections work just fine.

Now I have seen where no syn,ack sent back - when application is not even listing, firewall, etc.. But have not seen this before, where the server sees the syn from the client. But vs sending syn,ack it sure looks like its sending some garbage ack back.. Look at the large ack number..

So you see the client retrans the syn, and then the server sending even more garbage acks.

Anyone ever see anything like this? Keep in mind this only happens for a few times out of the 1000, the other ones all work, you see syn,ack then ack and conversation happens.

I would guess that something on the server side, but they say when they run the test from a different client that everything works fine with no errors. But I can not find anything different in the syns that get syn,ack and the ones that don't. Waiting to see the full sniff from the test box that doesn't get any errors to the same mq server..

Link to comment
Share on other sites

Wow you never usually post :-p

 

So things like this generally tend to be poorly coded apps on things like those harry wong (Cheap Chineese stuff) devices like Set top boxes, Door Entry systems, User stuff that is making its way onto the network that kind of thing.

 

Although you talk about an MQ server I just read that....

Link to comment
Share on other sites

Yeah this is a IBM MQ server, not sure on the exact version but I believe it 6.0.2.5.5

 

So IBM is pointing fingers at MS, and MS says its the app or that it is on a VM..  Which is just nonsense, the MS guy wanted to follow up on why the machine wasn't seeing the RST that the client box send to that weird ack.  Which is nonsense -- has nothing to do with the issue that there is no syn,ack

 

I normally don't post because I don't normally run into issues I have not seen or can not troubleshoot ;)  This one I don't have access to the servers or the client, I don't really understand the process flow of the system and applications - I have not worked with this java and mq stuff for years and years.  And even then was minor stuff.

 

Stuck on this one - its not my server, not my client that I can touch and troubleshoot - having to ask for info from 3rd party that has access.  I didn't setup the servers and seems to me they were just installed no tweaking or anything to run this system better.  They didn't run into this problem until they updated to newer client side to new 8.5 Webshere application server.  Not sure why the mq box was not upgraded to more current version, not sure why it is running on a 2003sp1 box?

 

So of the info we have is that IBM says MQ is never seeing the syn, so its the OS sending back that weird ack??  Not sure I buy that - some systems might send back ack,rst others should just drop the syn if backlog was full.  There shouldn't be this odd ball ack.  The systems does not have the dynamic backlog enabled I had that checked..  Its like going back in time trying to figure out how a 2k3 box handled this stuff ;)  I have never seen the OS send back just ack to syn, and if was just the flag not set it shouldn't be that large ack number.  Now they do create a lot of connections, but if OS ran out of memory, backlog filled it would just drop the syn and you would see no anything back..  And you would think it would happen on sequential syns.. Which its not.

 

Only see it from 3 prod clients, and they take much longer to send the amount of traffic because they are doing prod work at the same time.  While if they do a from a test box the 1000 tests are sent in less than 1 minutes.  You would think that would generate a possible syn attack response - but that is not what I see in the sniffs.

 

So all 3 vendors are suppose to be on call on Monday - lets see if get to the bottom of this.. It my current role I don't deal with the server side unless called into a problem - more of the just the network guy on this..  Could of been in and out of the problem in a few minutes ;)  Hey see syn, and no syn,ack - not a network issue.  Good luck ;) heheheh

Link to comment
Share on other sites

Your guess would be as good as mine budman, I'm sure there's nothing for me to say that would be helpful to you.

 

Actually, if I had to guess, and I have to point out that I very rarely do any packet sniffing, so I may very well be wrong, but I'd guess that for those few connections that exhibit the problem, the port selected for the new connection is one that was already previously used to establish a connection to this same server ip/port, and the connection has remained open. Thus while the client with its new socket tries to start a brand new transmission stream to the server starting with sequence_num=0, as far as the server is concerned the previous connection/stream from that host ip/port is still open, and the sequence number it's now receiving is wrong (out of sequence). So the server, rather than sending back a SYN,ACK, sends back an ACK to inform the client of the sequence number it was actually expecting, allowing the client to recognise that a problem has occurred (either a bug in client code, or a corruption of the packet in transmission) and giving the client a chance to try to correct it. The client is certain that the sequence number is correct, and so is simply retransmitting with the same sequence number, leaving them both in a stale mate.

 

Note that in network programming (which I also admit I do not have a lot of experience with currently), it is possible for a program to bind a new socket to a port already in use on an open connection. I personally learned about this a while back with a simple python client/server script I wrote to get myself more familiar with networking programming (which I put on hold and must get back to sometime soon); with my server program, sometimes (depending on what I was doing) I would have to kill the existing running instance via task manager, which left the connection open. Upon re-starting the server (perhaps with new code this time), I had to have it set the SO_REUSEADDR flag on the socket before attempting to bind to the port, otherwise it would given me an error due to the existing open connection left behind by the previous instance I killed. There is also another alternate flag called SO_REUSEPORT. You can read some useful info about these here: http://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t

 

In the little example program I just mentioned, I did not have to use one of these flags on the client, since a random unused port is being picked by the OS. However I'm not aware that you can't. Note that as explained in the above link, these flags can allow multiple sockets to be opened (by the same program or multiple) against the same port at the same time. I imagine that in such a situation that the OS/network-stack would just merge all of the data streams into one, continuing the existing sequence numbering, otherwise it could never work because the sequence numbers on all of the other merged streams would be rejected. So, I wouldn't imagine that its a case of multiple sockets being active on one port at a time; I would think it more likely that on the client, a new socket is being created and assigned a port for which an open connection does exist (at least as far as the server is concerned), but for which no client-side socket is currently attached (with the SO_REUSEADDR or SO_REUSEPORT flag being used to allow attaching to it if the client side also considers the connection to be open, not just the server), and because no other socket was attached to this port, sequence numbering is started afresh at zero. However the socket is then connected to exactly the same destination (server) ip and port as the socket's local port previously was connected to, and as far as the server is concerned, not only is the connection from that source (client) ip/port still open, but something on the server (the MQ server software) still has a live listening socket attached, so it tries to resume the previous sequence numbering, which clashes with what the client is now trying to send.

 

Perhaps the server OS or MQ software is hanging onto improperly closed connections for too long? Perhaps there's a bug in the client software that's not closing these connections properly when its done with them?

Link to comment
Share on other sites

To be fair this is way too advanced for Neowin, You need the likes of the Cisco forums or the DeveloperWorks forums. The turnover of users may not be as many but you might get a more directed answer.

Link to comment
Share on other sites

^ yeah valid point ;)  I also believed it a bit too advanced for neowin.  But I figured it couldn't hurt - so I can not say I have never asked anything on neowin now - shot in the dark to be honest ;)

 

And might have hit some gold here - with this statement

 

"the port selected for the new connection is one that was already previously used to establish a connection to this same server ip/port, and the connection has remained open."

 

While that normally would seem unlikely - did get me thinking..  There are lots and lots of connections being made from these prod boxes..  I mean a lot!!  they run this batch job where you get 30K connections..  And they run into a issue where like 10% failures on the connections. Hmmm might need to take a look at netstat on the MQ box to see what is open, before they run the test script to see if any match..  Need to turn off the relative syn/ack numbers in wireshark..  Since that syn seq number in reality is really going to be something between from 0 to 4,294,967,295

 

So while you normally see really low ack numbers in wireshark - relative is on by default..   Going to have to take look at this, if anything to rule it out.  A netstat on the MQ box should show all connections in place from the source IPs in question.  Running the test script again and looking at the source ports of the failure should tell us if the ones that fail are reusing source ports.

 

Damn it -- didn't want to look at this on the weekend.. But now you got me curious ;)

  • Like 1
Link to comment
Share on other sites

Hello,

I seem to recall reading about something similar a few years ago, where it turned out to be an issue with the PHY on the NIC, itself.  The solution at the time was to replace the NIC with one using a chipset from a different vendor:  http://thehackernews.com/2013/02/flaw-in-intel-ethernet-controller.html

 

The symptoms are not exactly the same as reported in this article, but it might be worth swapping in a NIC with a different silicon vendor just as part of the troubleshooting process.

 

Regards,

 

Aryeh Goretsky

Edited by goretsky
Link to comment
Share on other sites

Well this looks like its going to go unanswered - they rebooted the MQ box that had been up for 282 days, and now can not recreate the errors.. Got to love it ;)

Link to comment
Share on other sites

Well this looks like its going to go unanswered - they rebooted the MQ box that had been up for 282 days, and now can not recreate the errors.. Got to love it ;)

 

a reboot solve the problems? that's the first rule in IT budman, you should know that!  :laugh:  :rofl:

Link to comment
Share on other sites

hehhe -- dude I told them to reboot as soon as they pulled me into the call ;) Even if they didn't want to reboot the whole server atleast restart the MQ service, etc. Now they are trying to generate errors with 5000 queries.. Lets see what happens.

Link to comment
Share on other sites

So it came back later in the day.  So still working this - but have some real progress that makes sense now with more info.
 
So for whatever reason seems connections are being left open.  Via a netstat taken just before test script run that now does 5000 queries, the 2 failures seen in the run still established connections on the MQ side, the client sent RST and seen on the client but the server did not rst the connection.  They were still listed as open even 1 hour after the test run.
 
Seems that connection are getting hung open, and then when the client side runs through its Ephemeral ports, default 16kish and tries to use a port that the MQ never closed on previous connection you see the failure.  And its clearly not payng attention to RST because multiple are being sent and seen on the MQ side from the client.

  • Like 1
Link to comment
Share on other sites

So it came back later in the day.  So still working this - but have some real progress that makes sense now with more info.

 

So for whatever reason seems connections are being left open.  Via a netstat taken just before test script run that now does 5000 queries, the 2 failures seen in the run still established connections on the MQ side, the client sent RST and seen on the client but the server did not rst the connection.  They were still listed as open even 1 hour after the test run.

 

Seems that connection are getting hung open, and then when the client side runs through its Ephemeral ports, default 16kish and tries to use a port that the MQ never closed on previous connection you see the failure.  And its clearly not payng attention to RST because multiple are being sent and seen on the MQ side from the client.

 

Glad I could help with pinpointing the problem. I wonder why it's ignoring those resets?

 

Do I get a special "helped the resident network guru with a technical networking issue" badge now? rofl.gifbiggrin.png

Link to comment
Share on other sites

Bug in MQ server / misconfiguration / or bug in application code perhaps?

 

Googling for "mq server not closing connections" turns up results like the following:

 

https://stackoverflow.com/questions/3061513/when-disconnecting-from-websphere-mq-with-c-sharp-client-tcp-connections-are-sti

https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000013892458

Link to comment
Share on other sites

Not my problem - not my servers, its not a network issue persay its a OS issue ;)  While it is related to the network stack of the OS or the application - it seems clear that the problem is somewhere on the host and not the wire.  Clearly it saw a RST, not closing that connection is not the network its the host be it OS or application not following the slam the door on that connection instruction ;)

 

To be honest none of my network gear that i work on is even involved, I just got pulled in to help -- but yes its a odd one, and yes its getting frustrating not having direct access to the servers and having to walk people through how to use command as easy as netstat or tcpview, etc..  Then to get the sniff have to pull them down through a rdp - which is like watching paint dry ;)

 

So a bit more info - so even lots of RST sent and seen the connection would not clear.  But using tcpview we closed one of the connection.  Now at first it did not close we were watching it and watching it.  But at some point maybe 30min later it closed - and all the other ports that were hung that we had identified from day before now also closed even though they were open for way more than 24 hours.

 

Current plan is to update it to sp2, and then see if we can get a conversation to hang.

 

Odd thing is after the hung ports cleared - we tried to duplicate the problem again.  Took a netstat to see what ports were open..  Ran through the test script and got 4 errors..  Looking in the sniff the connections from the source ports that caused did not work were in fact already established in the pre test netstat.  But when we went to go verify that they were still open like they were before they were clear.  But did not see any RST in the sniff on the mq box - so whey did they close?  Did not get a sniff on the client this time to see if RST was sent but not seen by the MQ server.

 

MS has been of little help,  The OS is EOL, and even when moved to sp2 will be EOL next year.  My suggestion is move to current OS, and current version of MQ - why still using such a old version of both os and software when the websphere side is current makes no sense to me.

edit: Yeah we are getting somewhere, in mq explorer we are seeing channels open long time.. One from the 14th, with a heartbeat of 300 seconds set - this should of closed 5 minutes after not receiving anything which it has not since the 14th.. So why hasn't it closed - ibm is looking into it.. Finally can see the end of the tunnel to this issue ;)

Link to comment
Share on other sites

I saw this thread...then I saw it was started by BudMan.

I have started to buy emergency supplies. The apocalypse is coming.

In all seriousness, even though we can't help you BudMan, I think we can spread your problem to different areas and see if we can see someone that has come across this problem.

Link to comment
Share on other sites

Well we have a solution I believe - finally!! Been asking IBM for days now how to close these connections that are staying open. Seems this is the solution

http://www-01.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&dc=DB560&dc=DB520&uid=swg21376219&loc=en_US&cs=UTF-8〈=en&rss=ct171websphere

We didn't see an issue with increasing number.. But this should close the connections that get out of sync - where the client does not have the connection any more but the MQ box does.

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.