View Issue Details

IDProjectCategoryView StatusLast Update
0006488Talerexchangepublic2020-10-03 14:09
ReporterFlorian Dold Assigned ToChristian Grothoff  
PrioritynormalSeverityminorReproducibilityhave not tried
Status resolvedResolutionfixed 
Target Version0.8Fixed in Version0.8 
Summary0006488: wallet's test-payment-multiple still fails with serialization errors (deposit+refresh conflict)
DescriptionCan be reproduced by running the wallet integration test

  ./testrunner test-payment-multiple

It seems like this time, the serialization errors are caused due to concurrent refreshes (from the wallet) and deposits (from the merchant).

Should the wallet wait with the refreshing until it got a response from the merchant?

There's also a slightly weird log line (minus one transactions? Is that because of an error code?):
Aug 18 12:07:50-289110 taler-exchange-httpd-112449(MEKW4Q4DZP774XW0J427AMACER) INFO Coin E9PY5V8A yielded -1 transactions of type get_refresh_session_by_coin
TagsNo tags attached.

Activities

Florian Dold

2020-08-18 08:56

manager  

Christian Grothoff

2020-08-18 10:29

manager   ~0016646

There should only ever be one operation per coin, regardless of whether it is a deposit (with merchant) or refresh (with exchange), as they obviously do conflict. So refreshing while depositing increases the chances of the merchant deposit failing. So definitively delay any refresh until you have a response (or hard timeout) from the merchant.

Florian Dold

2020-08-18 15:12

manager   ~0016648

I actually misread the exchange's log: There is *no* refresh involved. I've verified this by looking at the exchange's DB and by adding more logging to the wallet.

The wallet now supports sequentializing requests that somehow involve coins or reserves respectively. So making a request to the merchant's /pay endpoint will first acquire a lock on the "exchange-coins" resource, and the refresh operations in the wallet will do the same. (These "locks" are rather coarse-grained, as one /pay might involve multiple exchanges in the future. Also, the wallet still has no threads, so these are async-based locks.)

However, this doesn't solve the problem. I still get the same serialization conflicts, even though there are no parallel requests to the exchange made by the wallet.

Indeed the requests made by the merchant seem to be made concurrently: https://git.taler.net/merchant.git/tree/src/backend/taler-merchant-httpd_post-orders-ID-pay.c#n653

But how can this cause problems when these requests are for different coins?!

Florian Dold

2020-08-18 15:17

manager   ~0016649

Also note that even on my machine the test case doesn't always fail.

However, after the improvements in the wallet (sequentializing direct and indirect requests to the exchange that may conflict), this test case simply ought not fail anymore!

Christian Grothoff

2020-08-18 15:28

manager   ~0016650

As long as the exchange database is _tiny_, the page-level locks by Postgres are coarse and will result in "false positives". That's what I studied in my long experiments before: the serialization conflicts go down once the database grows. As the test likely runs against a virgin and thus tiny database, serialization errors are more prevalent than what will happen in production.

Christian Grothoff

2020-08-18 15:29

manager   ~0016651

Possible solutions: more re-tries by the exchange, more re-tries by the merchant, or by the wallet.

Florian Dold

2020-08-18 15:40

manager   ~0016652

Can we introduce some configuration parameter to control the thread pool size of the exchange (and later the number of DB handles when we have n:m threads to DB handles)? It's a configuration parameter that probably shouldn't be hard-coded in the first place!

(I can do this myself, just checking if there's a reason that it's hard-coded right now.)

Even if the wallet or the merchant re-tries, it unnecessarily complicates the test cases (esp. when fault injection is also involved), because the test harness has to additionally handle non-deterministic failures in the exchange. The exchange already does a pretty high number of retries.

Christian Grothoff

2020-08-18 15:44

manager   ~0016653

It's hard-coded for laziness. I'm not sure I'd make it a _configuration_ parameter, what about using a command-line option?
If not given, we should (on Linux) use CPU_COUNT() on the result of sched_getaffinity() to auto-determine the number of CPU cores available.

Florian Dold

2020-08-18 16:15

manager   ~0016654

Makes sense. Implemented here: https://git.taler.net/exchange.git/commit/?id=1cd3f3281b19e2b739eca02a4b21f756e962f78c

Florian Dold

2020-08-18 16:16

manager   ~0016655

Test case now passes reliably.

Issue History

Date Modified Username Field Change
2020-08-18 08:56 Florian Dold New Issue
2020-08-18 08:56 Florian Dold Status new => assigned
2020-08-18 08:56 Florian Dold Assigned To => Christian Grothoff
2020-08-18 08:56 Florian Dold File Added: exchange-httpd-testexchange-1-stderr.log
2020-08-18 10:29 Christian Grothoff Note Added: 0016646
2020-08-18 10:31 Christian Grothoff Assigned To Christian Grothoff => Florian Dold
2020-08-18 10:31 Christian Grothoff Target Version => 0.8
2020-08-18 10:31 Christian Grothoff Description Updated View Revisions
2020-08-18 15:12 Florian Dold Note Added: 0016648
2020-08-18 15:13 Florian Dold Assigned To Florian Dold => Christian Grothoff
2020-08-18 15:17 Florian Dold Note Added: 0016649
2020-08-18 15:28 Christian Grothoff Note Added: 0016650
2020-08-18 15:29 Christian Grothoff Note Added: 0016651
2020-08-18 15:40 Florian Dold Note Added: 0016652
2020-08-18 15:44 Christian Grothoff Note Added: 0016653
2020-08-18 16:15 Florian Dold Note Added: 0016654
2020-08-18 16:16 Florian Dold Status assigned => resolved
2020-08-18 16:16 Florian Dold Resolution open => fixed
2020-08-18 16:16 Florian Dold Note Added: 0016655
2020-10-03 14:09 Christian Grothoff Fixed in Version => 0.8