0004024: CPU utilization of FS on gnunet9 is too high - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0004024	GNUnet	file-sharing service	public	2015-10-26 01:34	2018-06-07 00:24

Reporter	Christian Grothoff	Assigned To	Christian Grothoff
Priority	high	Severity	major	Reproducibility	sometimes
Status	closed	Resolution	fixed
Platform	i7	OS	Debian GNU/Linux	OS Version	squeeze
Product Version	Git master
Target Version	0.11.0pre66	Fixed in Version	0.11.0pre66

Summary	0004024: CPU utilization of FS on gnunet9 is too high
Description	Traces suggest it is in gnunet-service-fs_pe.c::get_latest() and merge_pr() where we do a linear scan of a linked list. The issue is that the number of entries there should be bounded by the number of connected peers, but is not. Somehow, peers can create a 'struct GSF_PendingRequest', which creates a plan entry (struct GSF_PendingRequestPlanBijection), then the request is cleaned up (GSF_pending_request_cancel_), but without 'full_cleanup'. Somehow, the PR and the BI survive. The next time the peer issues the same request, the check in gnunet-service-fs_cp::test_exist_cb() fails to detect that we still have a matching PR (well, CP is right: it cancel'ed it after all). The new PR is then added to the same PE, so now we have two BI/PRs for the same query and peer. This can happen a lot, so we end up with a long DLL of BIs and excessive CPU consumption.
Steps To Reproduce	Run gnunet9 peer for a while.
Additional Information	(gdb) print rp->pe_head->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->pr $26 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x3c9e000, pr_tail = 0x250dc50, ttl = {abs_value_us = 1445551114366873}, start_time = {abs_value_us = 1445815582366873}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 4, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x1bb1fd0, replies_seen = 0x0, bf = 0x0, hnode = 0x5c94d10, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815582366947}, warn_task = 0x0, local_result_offset = 9215370345274876929, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1} (gdb) print rp->pe_head->next_PE->next_PE->next_PE->next_PE->pr $27 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x72e0340, pr_tail = 0x889dc30, ttl = {abs_value_us = 1445548631500674}, start_time = {abs_value_us = 1445815473500674}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 5, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x226c430, replies_seen = 0x0, bf = 0x0, hnode = 0x7565ec0, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815473500690}, warn_task = 0x0, local_result_offset = 2230422963627760641, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1} (gdb) print *rp->pe_head->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->next_PE->pr $16 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x3c9e000, pr_tail = 0x250dc50, ttl = {abs_value_us = 1445551114366873}, start_time = {abs_value_us = 1445815582366873}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 4, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x1bb1fd0, replies_seen = 0x0, bf = 0x0, hnode = 0x5c94d10, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815582366947}, warn_task = 0x0, local_result_offset = 9215370345274876929, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1} (gdb) print $10 $17 = {public_data = {query = {bits = {3431383598, 4291201156, 2335342679, 3877743069, 3061985059, 3582666198, 612382203, 951268149, 1161128333, 1040336393, 1794210090, 2831188678, 3075402478, 1938878346, 3024631312, 693392574}}, target = 0x0, pr_head = 0x72e0340, pr_tail = 0x889dc30, ttl = {abs_value_us = 1445548631500674}, start_time = {abs_value_us = 1445815473500674}, anonymity_level = 1, priority = 0, original_priority = 0, num_transmissions = 5, respect_offered = 0, options = GSF_PRO_DEFAULTS, type = GNUNET_BLOCK_TYPE_FS_DBLOCK, results_found = 0, has_started = 1}, rh = 0x0, rh_cls = 0x226c430, replies_seen = 0x0, bf = 0x0, hnode = 0x7565ec0, qe = 0x0, gh = 0x0, cadet_request = 0x0, llc_cont = 0x0, llc_cont_cls = 0x0, local_result = GNUNET_BLOCK_EVALUATION_OK_MORE, sender_pid = 0, origin_pid = 24, qe_start = {abs_value_us = 1445815473500690}, warn_task = 0x0, local_result_offset = 2230422963627760641, first_uid = 0, cadet_retry_count = 0, replies_seen_count = 0, replies_seen_size = 0, mingle = 0, have_first_uid = 1}
Tags	No tags attached.

Christian Grothoff 2015-10-26 01:44 manager ~0009787	I just noticed that the "pr_head" of the respective PR is non-NULL. So the PR knows that it is still associated with BIs. But 'pr->rh' is only set to NULL just before we call GSF_plan_notify_request_done() where we have a 'while (NULL != prd->pr_head)' loop, which cannot exit until all entries are removed. So either I'm missing a case where 'rh' is NULL'ed (grep strongly disagrees), or the PEs are created just after GSF_pending_request_cancel_ (pr, GNUNET_NO) is called!

Christian Grothoff 2015-10-26 01:53 manager ~0009788	Aha! When a new peer connected, we were going over the 'pending' requests to see if any of those might be good to send to the new connection. In this case, we might take an 'inactive' but pending request, and add it to the new peer. Badness results. Fixed in SVN 36578, together with a bunch of fresh assertions.

Christian Grothoff 2015-10-26 11:36 manager ~0009789	Can confirm this is fixed.

Date Modified	Username	Field	Change
2015-10-26 01:34	Christian Grothoff	New Issue
2015-10-26 01:34	Christian Grothoff	Status	new => assigned
2015-10-26 01:34	Christian Grothoff	Assigned To	=> Christian Grothoff
2015-10-26 01:44	Christian Grothoff	Note Added: 0009787
2015-10-26 01:53	Christian Grothoff	Note Added: 0009788
2015-10-26 01:55	Christian Grothoff	Status	assigned => feedback
2015-10-26 11:36	Christian Grothoff	Note Added: 0009789
2015-10-26 11:36	Christian Grothoff	Status	feedback => assigned
2015-10-26 11:36	Christian Grothoff	Status	assigned => resolved
2015-10-26 11:36	Christian Grothoff	Fixed in Version	=> 0.11.0pre66
2015-10-26 11:36	Christian Grothoff	Resolution	open => fixed
2018-06-07 00:24	Christian Grothoff	Status	resolved => closed