View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0006488||Taler||exchange||public||2020-08-18 08:56||2020-10-03 14:09|
|Reporter||Florian Dold||Assigned To||Christian Grothoff|
|Priority||normal||Severity||minor||Reproducibility||have not tried|
|Target Version||0.8||Fixed in Version||0.8|
|Summary||0006488: wallet's test-payment-multiple still fails with serialization errors (deposit+refresh conflict)|
|Description||Can be reproduced by running the wallet integration test|
It seems like this time, the serialization errors are caused due to concurrent refreshes (from the wallet) and deposits (from the merchant).
Should the wallet wait with the refreshing until it got a response from the merchant?
There's also a slightly weird log line (minus one transactions? Is that because of an error code?):
Aug 18 12:07:50-289110 taler-exchange-httpd-112449(MEKW4Q4DZP774XW0J427AMACER) INFO Coin E9PY5V8A yielded -1 transactions of type get_refresh_session_by_coin
|Tags||No tags attached.|
exchange-httpd-testexchange-1-stderr.log (722,147 bytes)
||There should only ever be one operation per coin, regardless of whether it is a deposit (with merchant) or refresh (with exchange), as they obviously do conflict. So refreshing while depositing increases the chances of the merchant deposit failing. So definitively delay any refresh until you have a response (or hard timeout) from the merchant.|
I actually misread the exchange's log: There is *no* refresh involved. I've verified this by looking at the exchange's DB and by adding more logging to the wallet.
The wallet now supports sequentializing requests that somehow involve coins or reserves respectively. So making a request to the merchant's /pay endpoint will first acquire a lock on the "exchange-coins" resource, and the refresh operations in the wallet will do the same. (These "locks" are rather coarse-grained, as one /pay might involve multiple exchanges in the future. Also, the wallet still has no threads, so these are async-based locks.)
However, this doesn't solve the problem. I still get the same serialization conflicts, even though there are no parallel requests to the exchange made by the wallet.
Indeed the requests made by the merchant seem to be made concurrently: https://git.taler.net/merchant.git/tree/src/backend/taler-merchant-httpd_post-orders-ID-pay.c#n653
But how can this cause problems when these requests are for different coins?!
Also note that even on my machine the test case doesn't always fail.
However, after the improvements in the wallet (sequentializing direct and indirect requests to the exchange that may conflict), this test case simply ought not fail anymore!
||As long as the exchange database is _tiny_, the page-level locks by Postgres are coarse and will result in "false positives". That's what I studied in my long experiments before: the serialization conflicts go down once the database grows. As the test likely runs against a virgin and thus tiny database, serialization errors are more prevalent than what will happen in production.|
||Possible solutions: more re-tries by the exchange, more re-tries by the merchant, or by the wallet.|
Can we introduce some configuration parameter to control the thread pool size of the exchange (and later the number of DB handles when we have n:m threads to DB handles)? It's a configuration parameter that probably shouldn't be hard-coded in the first place!
(I can do this myself, just checking if there's a reason that it's hard-coded right now.)
Even if the wallet or the merchant re-tries, it unnecessarily complicates the test cases (esp. when fault injection is also involved), because the test harness has to additionally handle non-deterministic failures in the exchange. The exchange already does a pretty high number of retries.
It's hard-coded for laziness. I'm not sure I'd make it a _configuration_ parameter, what about using a command-line option?
If not given, we should (on Linux) use CPU_COUNT() on the result of sched_getaffinity() to auto-determine the number of CPU cores available.
||Makes sense. Implemented here: https://git.taler.net/exchange.git/commit/?id=1cd3f3281b19e2b739eca02a4b21f756e962f78c|
||Test case now passes reliably.|
|2020-08-18 08:56||Florian Dold||New Issue|
|2020-08-18 08:56||Florian Dold||Status||new => assigned|
|2020-08-18 08:56||Florian Dold||Assigned To||=> Christian Grothoff|
|2020-08-18 08:56||Florian Dold||File Added: exchange-httpd-testexchange-1-stderr.log|
|2020-08-18 10:29||Christian Grothoff||Note Added: 0016646|
|2020-08-18 10:31||Christian Grothoff||Assigned To||Christian Grothoff => Florian Dold|
|2020-08-18 10:31||Christian Grothoff||Target Version||=> 0.8|
|2020-08-18 10:31||Christian Grothoff||Description Updated||View Revisions|
|2020-08-18 15:12||Florian Dold||Note Added: 0016648|
|2020-08-18 15:13||Florian Dold||Assigned To||Florian Dold => Christian Grothoff|
|2020-08-18 15:17||Florian Dold||Note Added: 0016649|
|2020-08-18 15:28||Christian Grothoff||Note Added: 0016650|
|2020-08-18 15:29||Christian Grothoff||Note Added: 0016651|
|2020-08-18 15:40||Florian Dold||Note Added: 0016652|
|2020-08-18 15:44||Christian Grothoff||Note Added: 0016653|
|2020-08-18 16:15||Florian Dold||Note Added: 0016654|
|2020-08-18 16:16||Florian Dold||Status||assigned => resolved|
|2020-08-18 16:16||Florian Dold||Resolution||open => fixed|
|2020-08-18 16:16||Florian Dold||Note Added: 0016655|
|2020-10-03 14:09||Christian Grothoff||Fixed in Version||=> 0.8|