Experiencing HTTP requests timeouts to osmand port (5055) since upgrade from 5.9 to 6.x

Jacka year ago

I've been reading the whole logs obviously, I've been reporting here the section grepped by connection id as I saw no log lines without such prefix between the connected and disconnected lines where the problem happens

what I see between the connected and disconnected lines of dropped HTTP request is a bunch of other HTTP requests from the very same client pushing data for different devices. Some of those requests are answered correctly, some are ignored (1 in 10 in this specific case) just like the grepped one. All log lines are are level INFO and no errors are raised.

If I search the entire log file for any line with level different than INFO I only find those WARN on server boot:

2024-08-17 00:03:50  WARN: No systemd support - Unable to load library 'systemd':
2024-08-17 00:03:53  WARN: JAXBContext implementation could not be found. WADL feature is disabled.

and a bunch of WARN: Unknown device which I am aware of. My HTTP client is expecting HTTP status != 200 when this happens, here I have logs proving that Traccar is ignoring the HTTP request

Anton Tananaeva year ago

Have you tried using server.instantAcknowledgement?

Jacka year ago

No, but is there a procedure that may get longer than 30s from receiving HTTP request to sending a response?
Are you suggesting me that the huge delay is mysql not writing in time? Should I troubleshoot mysql instead of Traccar?

Using server.instantAcknowledgement would give me ACK without confidence that data has been written to DB. Seems not a solution to me, but sweeping the issue under the carpet.

Can you point me to the source code section after HTTP request is received, and where HTTP response is generated? This way I can get an idea of where the system is silently dropping the request.

Tony Shelvera year ago

Totally at a tangent, but I had something similar on a 5.10 release shortly after upgrade from 5.6 (I think it was those versions), and tried various settings to get around it.

It finally turned out to be a circular assignment in the groups hierarchy put in by one of my users, which would seem totally unrelated but may be worth checking.

Jacka year ago

thanks for the hint, but I have zero groups in my Traccar server

I also have just quickly checked for hanging queries in mysql via mysqladmin -u mysqlrootuser -p -i 1 processlist but I found none

Jacka year ago

For the sake of clarity, here's my msyqladmin status command results:

Uptime: 37407 Threads: 11 Questions: 222163 Slow queries: 0 Opens: 732 Flush tables: 3 Open tables: 652 Queries per second avg: 5.939

Jacka year ago

I've written a python parser for tracker-server.log to calculate the missing response ratio by:

grouping lines connection id
discriminate groups by presence or absence of osmand < and/or osmand >
assert non overlapping of the two group sets and preservation of the total unique connection id list

len(in_only_connections) / len(total_connections), len(in_out_connections) / len(total_connections):

tracker-server.log.20240814: (0.4959882787971813, 0.5040117212028187)
tracker-server.log.20240815: (0.048451174573929064, 0.9515488254260709)
tracker-server.log' (partial): (0.14011943242206762, 0.8598805675779324)

Numbers are saying that Traccar receives/logs osmand < HTTP request but doesn't respond/produce osmand > with a variable probability from 5% to 50%

Jacka year ago

by parsing all the logged connections with incoming request but no outgoing response, and searching matching points (id, timestamp, location) via Traccar reports, it is evident that some of the points ARE present, and some are not. I should query all those points via API and/or SQL query to get final stats.

I have to consider the lack of osmand > line in as both a false negative, and a true positive. server.instantAcknowledgement might solve the false negative case, but the true positive are problematic

Jacka year ago

After sampling 1000 random requests from the group with request (osmand <) but not response (osmand >) and querying via /api/positions endpoint for presence of that fix in the db, here's the result:

assert len(sample_df) == len(false_negative_points) + len(true_negative_points)

false_negative_ratio = len(false_negative_points) / len(sample_df)
true_negative_ratio = len(true_negative_points) / len(sample_df)

false_negative_ratio, true_negative_ratio

(0.167, 0.833)

it means that Traccar server is dropping silently a variable amount from 4% to 40% of all the incoming osmand data.
This is an alarming issue.

Anton Tananaeva year ago

Using instant acknowledgement is not just a workaround. It should also tell us something about the problem and help find the root cause. But if you don't want to try it, that's your choice.

Jacka year ago

I never said I don't want to try, that's your word not mine. I'm sorry for the misunderstanding.

This is all the documentation says about the instantAcknowledgement setting: Send device responses immediately before writing it in the database.

So I looked into the java source code https://github.com/traccar/traccar/blob/b1791569e71a39936c5d5d6e96996f55cb671ae4/src/main/java/org/traccar/BasePipelineFactory.java#L101

What I can grasp from that is that enabling that option would return 200 OK to HTTP client without knowing if point has been ingested or not by Traccar, making it actually harder to assess the size of the problem of apparent random nature, so I asked questions like:

which Traccar internal procedure may get longer than 30s from receiving HTTP request to sending a response
which component might be reponsible for this, java application or mysql server
two points in java source code before and after the section where the drop happens to trim down the possible causes

As you can see I'm diving into the stats of the problem, and without further info how instantAcknowledgement works all I can see from given information is that it would make it harder, not easier, to discriminate from true and false negative case.
From my perspective I have an HTTP client which requests timeout, a MYSQL instance that seems not complaining but a java application that logs incoming requests but is not responding, and drops a big chunk of the received data. Traccar application should be aware of pipeline being trucated, but logs and missing data in database proves otherwise.

Considering that I've just calculated the shape of the error distribution without instantAcknowledgement, I can now start debugging with a different strategy. I can blindly enable instantAcknowledgement but would you tell me please how it would help understanding more of problem and how to properly use it? Thanks

Jacka year ago

My server has been running with instantAcknowledgement enabled for 2 days now.

As expected, I'm not getting timeouts on the osmand http client side anymore, but without answers to my questions I have zero proofs that this is a solution.

As soon as I have enough log lines produced with instantAcknowledgement enabled I'll re-sample the request/response ratio (that I expect to be 1 now) but also the db stored ratio via api that would quantify again the loss

Jacka year ago

after testing a significant sample of points (>5k) after enabling instantAcknowledgement I can assert that:

the HTTP response rate is now 100%
no false positive are present: all logged points are stored in the DB and can be retrieved via API call

I'd call the topic problem solved, but my question are not aswered:

why osmand protocol is not returning HTTP response immediately by default (why instantAcknowledgement is not enabled by default)
why osmand protocol is not returning HTTP response with a (calculated) probability if ~20% when instantAcknowledgement is disabled (why instantAcknowledgement is needed at all)

Anton Tananaeva year ago

It's not enabled by default because waiting for database write is what people usually want to guarantee delivery.
I suspect the issue is that you're sending many devices through the same channel. It likely causes some issues with session tracking.

Jacka year ago

Thanks for the answers.

I'd also prefer to get ack of successful delivery on the client, instead of an instantAcknowledgement that doesn't prove successful message handling.

What puzzle me is the fact that instantAcknowledgement seems more than just a workaround, as it actually seems to fix the lossy ingestion of incoming osmand messages. Am I hitting a software limit on the osmand channel? I've a max of 9 points per second, but a mean of 1, it doesn't seem a large number at all. If that so, the problem lies somewhere before writing to the database.