View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0004024 | GNUnet | file-sharing service | public | 2015-10-26 01:34 | 2018-06-07 00:24 |
Reporter | Christian Grothoff | Assigned To | Christian Grothoff | ||
Priority | high | Severity | major | Reproducibility | sometimes |
Status | closed | Resolution | fixed | ||
Platform | i7 | OS | Debian GNU/Linux | OS Version | squeeze |
Product Version | Git master | ||||
Target Version | 0.11.0pre66 | Fixed in Version | 0.11.0pre66 | ||
Summary | 0004024: CPU utilization of FS on gnunet9 is too high | ||||
Description | Traces suggest it is in gnunet-service-fs_pe.c::get_latest() and merge_pr() where we do a linear scan of a linked list. The issue is that the number of entries there should be bounded by the number of connected peers, but is not. Somehow, peers can create a 'struct GSF_PendingRequest', which creates a plan entry (struct GSF_PendingRequestPlanBijection), then the request is cleaned up (GSF_pending_request_cancel_), but without 'full_cleanup'. Somehow, the PR and the BI survive. The next time the peer issues the same request, the check in gnunet-service-fs_cp::test_exist_cb() fails to detect that we still have a matching PR (well, CP is right: it cancel'ed it after all). The new PR is then added to the same PE, so now we have two BI/PRs for the same query and peer. This can happen a lot, so we end up with a long DLL of BIs and excessive CPU consumption. | ||||
Steps To Reproduce | Run gnunet9 peer for a while. | ||||
Additional Information | (gdb) print *rp->pe_head->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->pr $26 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x3c9e000, pr_tail = 0x250dc50, ttl = {abs_value_us = 1445551114366873}, start_time = {abs_value_us = 1445815582366873}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 4, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x1bb1fd0, replies_seen = 0x0, bf = 0x0, hnode = 0x5c94d10, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815582366947}, warn_task = 0x0, local_result_offset = 9215370345274876929, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1} (gdb) print *rp->pe_head->next_PE->next_PE->next_PE->next_PE->pr $27 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x72e0340, pr_tail = 0x889dc30, ttl = {abs_value_us = 1445548631500674}, start_time = {abs_value_us = 1445815473500674}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 5, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x226c430, replies_seen = 0x0, bf = 0x0, hnode = 0x7565ec0, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815473500690}, warn_task = 0x0, local_result_offset = 2230422963627760641, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1} (gdb) print *rp->pe_head->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->pr $16 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x3c9e000, pr_tail = 0x250dc50, ttl = {abs_value_us = 1445551114366873}, start_time = {abs_value_us = 1445815582366873}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 4, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x1bb1fd0, replies_seen = 0x0, bf = 0x0, hnode = 0x5c94d10, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815582366947}, warn_task = 0x0, local_result_offset = 9215370345274876929, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1} (gdb) print $10 $17 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x72e0340, pr_tail = 0x889dc30, ttl = {abs_value_us = 1445548631500674}, start_time = {abs_value_us = 1445815473500674}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 5, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x226c430, replies_seen = 0x0, bf = 0x0, hnode = 0x7565ec0, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815473500690}, warn_task = 0x0, local_result_offset = 2230422963627760641, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1} | ||||
Tags | No tags attached. | ||||
|
I just noticed that the "pr_head" of the respective PR is non-NULL. So the PR *knows* that it is still associated with BIs. But 'pr->rh' is only set to NULL just before we call GSF_plan_notify_request_done() where we have a 'while (NULL != prd->pr_head)' loop, which cannot exit until all entries are removed. So either I'm missing a case where 'rh' is NULL'ed (grep strongly disagrees), or the PEs are created just *after* GSF_pending_request_cancel_ (pr, GNUNET_NO) is called! |
|
Aha! When a new peer connected, we were going over the 'pending' requests to see if any of those might be good to send to the new connection. In this case, we might take an 'inactive' but pending request, and add it to the new peer. Badness results. Fixed in SVN 36578, together with a bunch of fresh assertions. |
|
Can confirm this is fixed. |
Date Modified | Username | Field | Change |
---|---|---|---|
2015-10-26 01:34 | Christian Grothoff | New Issue | |
2015-10-26 01:34 | Christian Grothoff | Status | new => assigned |
2015-10-26 01:34 | Christian Grothoff | Assigned To | => Christian Grothoff |
2015-10-26 01:44 | Christian Grothoff | Note Added: 0009787 | |
2015-10-26 01:53 | Christian Grothoff | Note Added: 0009788 | |
2015-10-26 01:55 | Christian Grothoff | Status | assigned => feedback |
2015-10-26 11:36 | Christian Grothoff | Note Added: 0009789 | |
2015-10-26 11:36 | Christian Grothoff | Status | feedback => assigned |
2015-10-26 11:36 | Christian Grothoff | Status | assigned => resolved |
2015-10-26 11:36 | Christian Grothoff | Fixed in Version | => 0.11.0pre66 |
2015-10-26 11:36 | Christian Grothoff | Resolution | open => fixed |
2018-06-07 00:24 | Christian Grothoff | Status | resolved => closed |