Server 6.4. positions not recorded (device freeze issue)

Victor Butler2 months ago

Hello,

I have been running traccar server version 6.4 and I am experiencing a device freeze issue yet once again - incoming HEX is recorded in the log, an event might be generated but no data in the positions table is saved.

Issue appears to be occurring for random devices.
No errors in the logs.

After server restart, positions start to be updated correctly. However, the data during the "freeze" period is missing.

Here's an example from the log file - top 3 records are for a frozen device, right after that we've got a working device:

//The device with the problem:
2024-09-09 06:33:06  INFO: [Uff2bd2c6: teltonika < XXX.XXX.XXX.XXX] 0079cafe015e000f3132333435363738393132333435368e0100000191d57b3c08000267bae31e5be78d000000000000000000000f000700ef0000f00000500400150500c80300450300716400050042325e00430fe20011ffc20012fffd0013ffd700030010092a7b38000c002a2a5001c1000909000000000001
2024-09-09 06:33:06  INFO: [Uff2bd2c6: teltonika > XXX.XXX.XXX.XXX] 00050000015e01
2024-09-09 06:33:06  INFO: Event id: 123456789123456, time: 2024-09-09 06:33:06, type: deviceOnline, notifications: 0
//Next record for a different device works
2024-09-09 06:33:07  INFO: [Uff2bd2c6: teltonika < XXX.XXX.XXX.XXX] 00cfcafe01e7000f3938373635343332313938373635348e0100000191d57c516000fe285a771eb82b040036002d1100410000001f000a00ef0100f00100500100150400c80000450100ed01007164019a00019b3d000c00b5000900b60006004236230018004100430fdf000d00040011ff700012000b00130007000f0000047f006404800011000700f100005b7200c700000000001002433940000c000c586a01c1001e4b5b01850000944801860000000000000002010000114b4e414343383147554c353034363133330119000001
2024-09-09 06:33:07  INFO: [Uff2bd2c6: teltonika > XXX.XXX.XXX.XXX] 0005000001e701
2024-09-09 06:33:07  INFO: [Uff2bd2c6] id: 987654321987654, time: 2024-09-09 06:33:00, lat: 51.53861, lon: -3.09098, speed: 35.1, course: 45.0

What can I do to troubleshoot?

Kaloyan Kanev2 months ago

Looks like is not valid TCP data, check device configuration and change it from UDP to TCP and try again

Victor Butler2 months ago

That is correct, I am using a UDP protocol with instant acknowledgement.

UDP was never a problem before (versions 5.6 and older), moreover as you can see the HEX is received and an event is recorded. So a "lost" UDP packet cannot explain the issue at hand.

Anton Tananaev2 months ago

Maybe try TCP protocol.

Victor Butler2 months ago

If it was a protocol issue, why is only the event registered and not the positions data? It doesn't make any sense...

Anton Tananaev2 months ago

There are pretty significant differences between the way TCP and UDP are handled, so there can easily be a problem there.

Victor Butler2 months ago

Is there a single place in any of the handlers where I can add a log reporting that will capture all positions imports (and respectively any errors)?

Anton Tananaev2 months ago

If there was, don't you think we would have had logging there already?

Victor Butler2 months ago

I am not in your head, I don't know what you would or would not do.

All I know is that there is an issue recording the positions data. It might be UDP protocol related but it's still an issue and switching to TCP would not resolve it.

I am willing to help resolve it by logging the positions handler output but need some advice where to start with.

I would do it myslef but it would take days. With your knowledge it will be only a few minutes.

Anton Tananaev2 months ago

I'm giving a suggestion on what to try, but you're not interested, so I don't think we're on the same page here.

Victor Butler2 months ago

Yes we are not. Your suggestion is a workaround, not a fix for the issue and I am interested in fixing the problem.

Also, I am starting to think you know more about the issue that you are willing to share. Is there a known issue with UDP that cannot be fixed for the moment?

Anton Tananaev2 months ago

My suggestion is not a workaround. It's a way to narrow down the root cause of the problem.

Victor Butler2 months ago

I'd like to do this too but the issue occurs on a random set of devices at random times. So I never know which devices will freeze. I cannot switch all devices to TCP only to test a theory.

Setting a log write at the right place(s) in the source code should help to pinpoint the issue a lot faster and definitely a lot more efficiently.

Since you know the code well, I was hoping you can advice on how to debug fast.

Victor Butler2 months ago

Anton, another user just confirmed that he still experiences device freeze on version 6.4 over TCP. See here.

I've also done some log analysis after running the server for about 24 hours after restart yesterday.

360 devices have sent data to the server.
100% of the records are acknowledged, i.e. the server sent response 00050000018101 back.
20 devices have missing records.
15 devices have just one missing record. I still need to confirm if it's the last record that is missing or not. If it's the last one, it might mean that device might have just froze.
5 devices are confirmed frozen, records count missing between 50 and 300. No new records are recorded in positions table.

I will start looking in the code in a few moments and will appreciate your feedback at where I can start from.

Anton Tananaev2 months ago

Just add more logging in some pipeline handlers and see what happens to decoded positions.