0002032: Extract keywords from file names - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0002032	GNUnet	file-sharing service	public	2011-12-23 22:28	2011-12-26 22:30

Reporter	LRN	Assigned To	LRN
Priority	low	Severity	feature	Reproducibility	N/A
Status	closed	Resolution	fixed
Product Version	0.9.0
Target Version	0.9.1	Fixed in Version	0.9.1

Summary	0002032: Extract keywords from file names
Description	Two methods (working independently): 1) Tokenize filename using " ", "_" and "." as delimiters, use tokens as keywords (that includes file extension). This_File was_made_as.an.example.to.illustrate.the.point.tar.gz will provide "This", "File", "was", "made", "as", "an", "example", "to", "illustrate", "the", "point", "tar", "gz" keywords. 1a) Also use combinations of tokens (without changing the order) as keywords. For example, to the above we will add "This_File", "This_File was", "This_File was_made", ..., "File was", "File was_made", ..., "was_made", "was made_as", "was made_as.an", etc. That's a LOT of keywords though (OTOH, this example filename is quite long in itself; also, filename length is usually capped by filesystem so it won't get REALLY bad). 2) Parse filename, find tokens enclosed in matching parentheses - {}, [] and (), and use them as keywords. In case of nested parentheses only inner pair counts. For example "This.is.my.[boomstick]" will yield "boomstick" as a keyword (independently of anything (1) would do).
Additional Information	I've attached a patch that implements (1) and (2). Sadly, the patch is NOT UTF-8 aware. I'm not sure about the appropriateness of the place i put the extraction calls, but they can be moved (most of the work is done by separate functions).
Tags	No tags attached.
Attached Files	0001-Extract-keywords-from-filenames.patch (7,732 bytes) From e6b4ff4b126f3f3e3c1a89eb5afcc15408f1b73f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=D0=A0=D1=83=D1=81=D0=BB=D0=B0=D0=BD=20=D0=98=D0=B6=D0=B1=D1?= =?UTF-8?q?=83=D0=BB=D0=B0=D1=82=D0=BE=D0=B2?= <lrn1986@gmail.com> Date: Sat, 24 Dec 2011 00:31:37 +0400 Subject: [PATCH] Extract keywords from filenames --- src/fs/fs_uri.c \| 189 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 files changed, 179 insertions(+), 10 deletions(-) diff --git a/src/fs/fs_uri.c b/src/fs/fs_uri.c index 62fd513..7ca41ef 100644 --- a/src/fs/fs_uri.c +++ b/src/fs/fs_uri.c @@ -1577,6 +1577,156 @@ GNUNET_FS_uri_test_loc (const struct GNUNET_FS_Uri uri) return uri->type == loc; } +static int +insert_non_mandatory_keyword (const char s, char *array, int index) +{ + char nkword; + GNUNET_asprintf (&nkword, " %s", /* space to mark as 'non mandatory' / s); + array[index] = nkword; + return 1; +} + +static int +find_duplicate (const char s, const char array, int index) +{ + int j; + for (j = index - 1; j >= 0; j--) + if (0 == strcmp (&array[j][1], s)) + return GNUNET_YES; + return GNUNET_NO; +} + +/ + * Break the filename up by matching [], () and {} pairs to make + * keywords. In case of nesting parentheses only the inner pair counts. + * You can't escape parentheses to scan something like "[blah\{foo]" to + * make a "blah{foo" keyword, this function is only a heuristic! + * + * @param s string to break down. + * @param array array to fill with enclosed tokens. If NULL, then tokens + * are only counted. + * @param index index at which to start filling the array (entries prior + * to it are used to check for duplicates). ignored if array == NULL. + * @return number of tokens counted (including duplicates), or number of + * tokens extracted (excluding duplicates). 0 if there are no + * matching parens in the string (when counting), or when all tokens + * were duplicates (when extracting). + / +static int +get_keywords_from_parens (char s, char *array, int index) +{ + int count = 0; + char open_paren, close_paren, ss, tmp; + if (NULL == s) + return 0; + if (NULL != array) + ss = GNUNET_strdup (s); + else + ss = s; + for (close_paren = ss - 1; NULL != (open_paren = strpbrk (close_paren + 1, "[{("));) + { + int match = 0; + close_paren = strpbrk (open_paren + 1, "]})"); + if (NULL == close_paren) + break; + switch (open_paren[0]) + { + case '[': + if (']' == close_paren[0]) + match = 1; + break; + case '{': + if ('}' == close_paren[0]) + match = 1; + break; + case '(': + if (')' == close_paren[0]) + match = 1; + break; + default: + break; + } + if (match && (close_paren - open_paren > 1)) + { + if (NULL != array) + { + tmp = close_paren[0]; + close_paren[0] = '\0'; + if (GNUNET_NO == find_duplicate ((const char ) &open_paren[1], (const char ) array, index + count)) + { + count += insert_non_mandatory_keyword ((const char ) &open_paren[1], array, + index + count); + } + close_paren[0] = tmp; + } + else + count += 1; + } + } + if (NULL != array) + GNUNET_free (ss); + return count; +} + +/** + * Break the filename up by "_", " " and "." (any other separators?) to make + * keywords. + * + * @param s string to break down. + * @param array array to fill with tokens. If NULL, then tokens are only + * counted. + * @param index index at which to start filling the array (entries prior + * to it are used to check for duplicates). ignored if array == NULL. + * @return number of tokens (>1) counted (including duplicates), or number of + * tokens extracted (excluding duplicates). 0 if there are no + * separators in the string (when counting), or when all tokens were + * duplicates (when extracting). + / +static int +get_keywords_from_tokens (char s, char *array, int index) +{ + char p, p_prev, ss, tmp; + int seps = 0; + if (NULL != array) + ss = GNUNET_strdup (s); + else + ss = s; + p_prev = p = ss; + for (p_prev = p = ss; NULL != (p = strpbrk (p, "_. ")); p_prev = p = p + 1) + { + /* don't count 0-length tokens / + if (p - p_prev == 0) + continue; + if (NULL != array) + { + tmp = p[0]; + p[0] = '\0'; + if (GNUNET_NO == find_duplicate ((const char ) p_prev, (const char *) array, index + seps)) + { + seps += insert_non_mandatory_keyword ((const char ) p_prev, array, + index + seps); + } + p[0] = tmp; + } + else + seps += 1; + } + if (NULL != array) + { + if (seps > 0 && p_prev != NULL && strlen (p_prev) + && !find_duplicate ((const char ) p_prev, (const char ) array, + index + seps)) + { + seps += insert_non_mandatory_keyword ((const char ) p_prev, array, + index + seps); + } + GNUNET_free (ss); + } + else if (seps > 0) + /* Turn it into the number of keywords (1 separator == 2 keywords) / + seps += 1; + return seps; +} /* * Function called on each value in the meta data. @@ -1601,18 +1751,14 @@ gather_uri_data (void cls, const char plugin_name, const char data_mime_type, const char data, size_t data_len) { struct GNUNET_FS_Uri uri = cls; - char nkword; - int j; if ((format != EXTRACTOR_METAFORMAT_UTF8) && (format != EXTRACTOR_METAFORMAT_C_STRING)) return 0; - for (j = uri->data.ksk.keywordCount - 1; j >= 0; j--) - if (0 == strcmp (&uri->data.ksk.keywords[j][1], data)) - return GNUNET_OK; - GNUNET_asprintf (&nkword, " %s", /* space to mark as 'non mandatory' / - data); - uri->data.ksk.keywords[uri->data.ksk.keywordCount++] = nkword; + if (find_duplicate (data, (const char ) uri->data.ksk.keywords, uri->data.ksk.keywordCount)) + return GNUNET_OK; + uri->data.ksk.keywordCount += insert_non_mandatory_keyword (data, + uri->data.ksk.keywords, uri->data.ksk.keywordCount); return 0; } @@ -1630,7 +1776,9 @@ GNUNET_FS_uri_ksk_create_from_meta_data (const struct GNUNET_CONTAINER_MetaData md) { struct GNUNET_FS_Uri ret; - int ent; + char filename, full_name; + char ss; + int ent, tok_keywords = 0, paren_keywords = 0; if (md == NULL) return NULL; @@ -1639,9 +1787,30 @@ GNUNET_FS_uri_ksk_create_from_meta_data (const struct GNUNET_CONTAINER_MetaData ent = GNUNET_CONTAINER_meta_data_iterate (md, NULL, NULL); if (ent > 0) { - ret->data.ksk.keywords = GNUNET_malloc (sizeof (char ) ent); + full_name = GNUNET_CONTAINER_meta_data_get_first_by_types (md, + EXTRACTOR_METATYPE_FILENAME, -1); + if (NULL != full_name) + { + filename = full_name; + while (NULL != (ss = strstr (filename, DIR_SEPARATOR_STR))) + filename = ss + 1; + tok_keywords = get_keywords_from_tokens (filename, NULL, 0); + paren_keywords = get_keywords_from_parens (filename, NULL, 0); + } + ret->data.ksk.keywords = GNUNET_malloc (sizeof (char ) (ent + + tok_keywords + paren_keywords)); GNUNET_CONTAINER_meta_data_iterate (md, &gather_uri_data, ret); } + if (tok_keywords > 0) + ret->data.ksk.keywordCount += get_keywords_from_tokens (filename, + ret->data.ksk.keywords, + ret->data.ksk.keywordCount); + if (paren_keywords > 0) + ret->data.ksk.keywordCount += get_keywords_from_parens (filename, + ret->data.ksk.keywords, + ret->data.ksk.keywordCount); + if (ent > 0) + GNUNET_free (full_name); return ret; } -- 1.7.4 0001-Extract-keywords-from-filenames.patch (7,732 bytes)

Date Modified	Username	Field	Change
2011-12-23 22:28	LRN	New Issue
2011-12-23 22:28	LRN	File Added: 0001-Extract-keywords-from-filenames.patch
2011-12-24 17:05	Christian Grothoff	Note Added: 0005192
2011-12-24 17:05	Christian Grothoff	Status	new => resolved
2011-12-24 17:05	Christian Grothoff	Fixed in Version	=> 0.9.1
2011-12-24 17:05	Christian Grothoff	Resolution	open => fixed
2011-12-24 17:05	Christian Grothoff	Assigned To	=> LRN
2011-12-25 17:49	Christian Grothoff	Product Version	=> 0.9.0
2011-12-25 17:49	Christian Grothoff	Target Version	=> 0.9.1
2011-12-26 22:30	Christian Grothoff	Status	resolved => closed