0006488: wallet's test-payment-multiple still fails with serialization errors (deposit+refresh conflict)

ID	Project	Category	View Status	Date Submitted	Last Update

0006488	Taler	exchange	public	2020-08-18 08:56	2021-08-24 16:23

Reporter	Florian Dold	Assigned To	Christian Grothoff
Priority	normal	Severity	minor	Reproducibility	have not tried
Status	closed	Resolution	fixed
Target Version	0.8	Fixed in Version	0.8

Summary	0006488: wallet's test-payment-multiple still fails with serialization errors (deposit+refresh conflict)
Description	Can be reproduced by running the wallet integration test ./testrunner test-payment-multiple It seems like this time, the serialization errors are caused due to concurrent refreshes (from the wallet) and deposits (from the merchant). Should the wallet wait with the refreshing until it got a response from the merchant? There's also a slightly weird log line (minus one transactions? Is that because of an error code?): Aug 18 12:07:50-289110 taler-exchange-httpd-112449(MEKW4Q4DZP774XW0J427AMACER) INFO Coin E9PY5V8A yielded -1 transactions of type get_refresh_session_by_coin
Tags	No tags attached.
Attached Files	exchange-httpd-testexchange-1-stderr.log (722,147 bytes)

Christian Grothoff 2020-08-18 10:29 manager ~0016646	There should only ever be one operation per coin, regardless of whether it is a deposit (with merchant) or refresh (with exchange), as they obviously do conflict. So refreshing while depositing increases the chances of the merchant deposit failing. So definitively delay any refresh until you have a response (or hard timeout) from the merchant.

Florian Dold 2020-08-18 15:12 manager ~0016648	I actually misread the exchange's log: There is no refresh involved. I've verified this by looking at the exchange's DB and by adding more logging to the wallet. The wallet now supports sequentializing requests that somehow involve coins or reserves respectively. So making a request to the merchant's /pay endpoint will first acquire a lock on the "exchange-coins" resource, and the refresh operations in the wallet will do the same. (These "locks" are rather coarse-grained, as one /pay might involve multiple exchanges in the future. Also, the wallet still has no threads, so these are async-based locks.) However, this doesn't solve the problem. I still get the same serialization conflicts, even though there are no parallel requests to the exchange made by the wallet. Indeed the requests made by the merchant seem to be made concurrently: https://git.taler.net/merchant.git/tree/src/backend/taler-merchant-httpd_post-orders-ID-pay.c#n653 But how can this cause problems when these requests are for different coins?!

Florian Dold 2020-08-18 15:17 manager ~0016649	Also note that even on my machine the test case doesn't always fail. However, after the improvements in the wallet (sequentializing direct and indirect requests to the exchange that may conflict), this test case simply ought not fail anymore!

Christian Grothoff 2020-08-18 15:28 manager ~0016650	As long as the exchange database is _tiny_, the page-level locks by Postgres are coarse and will result in "false positives". That's what I studied in my long experiments before: the serialization conflicts go down once the database grows. As the test likely runs against a virgin and thus tiny database, serialization errors are more prevalent than what will happen in production.

Christian Grothoff 2020-08-18 15:29 manager ~0016651	Possible solutions: more re-tries by the exchange, more re-tries by the merchant, or by the wallet.

Florian Dold 2020-08-18 15:40 manager ~0016652	Can we introduce some configuration parameter to control the thread pool size of the exchange (and later the number of DB handles when we have n:m threads to DB handles)? It's a configuration parameter that probably shouldn't be hard-coded in the first place! (I can do this myself, just checking if there's a reason that it's hard-coded right now.) Even if the wallet or the merchant re-tries, it unnecessarily complicates the test cases (esp. when fault injection is also involved), because the test harness has to additionally handle non-deterministic failures in the exchange. The exchange already does a pretty high number of retries.

Christian Grothoff 2020-08-18 15:44 manager ~0016653	It's hard-coded for laziness. I'm not sure I'd make it a _configuration_ parameter, what about using a command-line option? If not given, we should (on Linux) use CPU_COUNT() on the result of sched_getaffinity() to auto-determine the number of CPU cores available.

Florian Dold 2020-08-18 16:15 manager ~0016654	Makes sense. Implemented here: https://git.taler.net/exchange.git/commit/?id=1cd3f3281b19e2b739eca02a4b21f756e962f78c

Florian Dold 2020-08-18 16:16 manager ~0016655	Test case now passes reliably.

Date Modified	Username	Field	Change
2020-08-18 08:56	Florian Dold	New Issue
2020-08-18 08:56	Florian Dold	Status	new => assigned
2020-08-18 08:56	Florian Dold	Assigned To	=> Christian Grothoff
2020-08-18 08:56	Florian Dold	File Added: exchange-httpd-testexchange-1-stderr.log
2020-08-18 10:29	Christian Grothoff	Note Added: 0016646
2020-08-18 10:31	Christian Grothoff	Assigned To	Christian Grothoff => Florian Dold
2020-08-18 10:31	Christian Grothoff	Target Version	=> 0.8
2020-08-18 10:31	Christian Grothoff	Description Updated
2020-08-18 15:12	Florian Dold	Note Added: 0016648
2020-08-18 15:13	Florian Dold	Assigned To	Florian Dold => Christian Grothoff
2020-08-18 15:17	Florian Dold	Note Added: 0016649
2020-08-18 15:28	Christian Grothoff	Note Added: 0016650
2020-08-18 15:29	Christian Grothoff	Note Added: 0016651
2020-08-18 15:40	Florian Dold	Note Added: 0016652
2020-08-18 15:44	Christian Grothoff	Note Added: 0016653
2020-08-18 16:15	Florian Dold	Note Added: 0016654
2020-08-18 16:16	Florian Dold	Status	assigned => resolved
2020-08-18 16:16	Florian Dold	Resolution	open => fixed
2020-08-18 16:16	Florian Dold	Note Added: 0016655
2020-10-03 14:09	Christian Grothoff	Fixed in Version	=> 0.8
2021-08-24 16:23	Christian Grothoff	Status	resolved => closed

View Issue Details

Activities

Issue History