Tuesday, June 16, 2009

The mysterious ORA-03111 error

Recently one of the applications that I developed started throwing exceptions, that had the following message:
SQL state [72000]; error code [1013]; ORA-03111: break received on communication channel
When I googled around, I couldn't come across anything useful. Sadly enough most of the sites just showed the documentation for that error, without any explanation from anyone experiencing that issues. So here you go, with the best possible explanation that I could come up with.

My application sets two things on the connection that is throwing this exception:
  • It sets the fetchSize to be 2500 rows
  • It sets the query timeout to be 10 seconds
The database server and the application are separated over a long latency network (actually there is a NetEm box that emulates the long latency between these two boxes) which has a latency characteristic of 50+/-5 milliseconds. This is the whole setup.

It is important to understand how the timeout is handled by the Oracle client (in my case JDBC client). Once the query is successfully submitted, the client starts a clock for the timeout. Once the timeout is reached, the client sends an URG message to the Oracle server. The moment Oracle server receives this message, it knows that the client wants to cancel the operation that it was carrying on, no matter what stage the operation is in.

So take a couple of cases. Assume the operation is a SELECT query that will result in 10000 rows. If the Oracle server hasn't even started fetching the results, most likely the client's request would be responded immediately with an error code ORA-01013, which has a description like:
SQL state [72000]; error code [1013]; ORA-01013: user requested cancel of current operation
But imagine the server has fetched the rows and in the process of pumping the resultset back to the client. If the client requests the Oracle server to cancel the operation while still there is pending data in the socket to be delivered, it just adds the ORA-03111 packet at the end of the pending packets and lets the client knows that the operation has been cancelled while there is pending data to be delivered.

Look at the tcpdump output below:
23:13:08.613007 IP jdbc_client.48681 > orcle_server.1521: P 2543:3174(631) ack 2342 win 11908
....
23:13:18.635068 IP jdbc_client.48681 > orcle_server.1521: P 3174:3175(1) ack 265693 win 65535 urg 1
....
23:13:20.472561 IP orcle_server.1521 > jdbc_client.48681: P 398520:398615(95) ack 3186 win 65535
0x0000: 0015 c5ec 12a8 0021 1c1d c0c3 0800 4500 .......!......E.
0x0010: 0087 5023 0000 3406 bad6 c0a8 fd9a c0a8 ..P#..4.........
0x0020: fc8b 05f1 be29 a792 459f a091 06cc 5018 .....)..E.....P.
0x0030: ffff 67c2 0000 005f 0000 0600 0000 0000 ..g...._........
0x0040: 0402 04e3 0203 f500 0001 0300 0300 0000 ................
0x0050: 0000 0000 0000 0000 0000 0001 0100 0000 ................
0x0060: 0033 4f52 412d 3033 3131 313a 2062 7265 .3ORA-03111:.bre
0x0070: 616b 2072 6563 6569 7665 6420 6f6e 2063 ak.received.on.c
0x0080: 6f6d 6d75 6e69 6361 7469 6f6e 2063 6861 ommunication.cha
0x0090: 6e6e 656c 0a nnel.

Pay special attention to the times when the SELECT query was sent (21:13:08) and when the cancel request as an URG packet was sent (21:13:18), and when the Oracle sends the last TNS packet that has the error code ORA-03111 (21:13:20).

The cancel request as an URG packet was sent after 10 seconds because as I mentioned earlier my query timeout is 10 seconds.

So now the million dollar question: What should I do if I am facing this issue in my application?

Follow these simple steps:
  • First make sure that your query can be completed within the timeout that you have specified. If you consistently face this exception, try increasing your timeout.
  • That might help to get rid of the exception, but not the root cause. The root cause usually is a database that is not optimized for the query that you are executing or a bad network.
  • To find out if its the database that is the issue, try executing the same query in a host closer to the network. Or try executing the same query hitting the database from a different network. If you are convinced the database is the issue, try to tune it.
  • To find if it is the network that is having the issue, try to do a tcpdump and analyze if there are any out of order deliver of packets. Or dropped packets. If yes, then try to fix the network.
In my case, it turned out to be the bad configuration in the NetEm that was causing too many packets to be delivered out of order and too many duplicated packets. Remember I was introducing a variance of 10 ms (i.e. my packets could be delayed anywhere from 45 ms to 55 ms, as per my configuration). In real cases, at least in a well maintained production network, the variance will not be more than 1 ms.

Since I am not an expert in Oracle, I would be happy if anyone reading this blog entry has something to add on top of what I have told here. And I sincerely believe that this posting would help whoever is facing this issue.

26 comments:

calmze said...

Thanks. Excellent post. I've been getting this error too.

veenu said...

Thanx...excellent discription

Nayan said...

This issue has been excellently described .
Thanks for that .

Anonymous said...

Thanks Roy. Very helpful.

Daniela said...

Thank you so much :)

jijeesh said...

Thanks Roy, It is helpfull for me..

Iam working in a unix platform which connects to database, sometimes iam getting the intimation of
the error ORA-01013 of cancelling a query, i know that it may be also a time out.. There are many jobs running in the server, but iam unable to find which one produce this timeout.

could you please let me know, where i could find out the details of the error stored, and also please tell me how to trigger that particular job

Faisal Khan said...

Thanks, its very helpful.

Anonymous said...

thanks! : )

Senthil Balakrishnan said...

Nice post.

Anonymous said...

Excellent post...

Anonymous said...

Check how long your it is taking to execute the procedure. It may be more than the timeout set on the statement.

Anonymous said...

indeed a good explanation!!!

Adust Jed said...

Thanks, was very heplful

mike said...
This comment has been removed by the author.
mike said...

Very Useful a post. Thax

Babu said...

It was nice..

Just want to add here.

In tomcat the connection timeout was 300 Sec [ By default ] & In JDBC the statement timeout was 15 sec [ In my case ]

Increasing the timeout we can avoid this message but we actually need to find the problem sql query to tune.

between once again thanks for your knowledge share.

Babu said...

It was nice..

Just want to add here.

In tomcat the connection timeout was 300 Sec [ By default ] & In JDBC the statement timeout was 15 sec [ In my case ]

Increasing the timeout we can avoid this message but we actually need to find the problem sql query to tune.

between once again thanks for your knowledge share.

harsh lakhani said...

excellent explanation , thanks Roy...

Anandan B said...

Excellant engineering perspective. Appreciated

dnsh said...

Hi all, we have weblogic where jta timeout is set as 300 secs.We have a issue wherein some of the queries are getting cancelled or timeout after 300 secs with ORA-03111 error.But for some cases its not so and the queries keep on running at the db and the weblogic becomes overloaded state and causing slowness to application.

FL said...

We had this same error when trying to use the managed-code Oracle .NET provider with an Oracle 9 database. Turns out it isn't supported. Maybe this helps someone. Thanks for the post!

Anonymous said...

I love this blog dude

Anonymous said...

Thanks.....excellent...

Anonymous said...

Good info on ORA-03111 and ORA-01013. Been looking for awhile and found this blog! Thanks1

Yahia Musse said...

Hello from 2016, this was really helpful. Thanks a lot!

Christian said...

Still relevant in 2017 ;)

Thank you!