0001946: add button for keyword normalization

ID	Project	Category	View Status	Date Submitted	Last Update

0001946	gnunet-gtk	gnunet-fs-gtk	public	2011-11-22 20:25	2011-12-26 22:30

Reporter	Christian Grothoff	Assigned To	LRN
Priority	low	Severity	feature	Reproducibility	N/A
Status	closed	Resolution	won't fix
Product Version	Git master
Target Version	0.9.1	Fixed in Version	0.9.1

Summary	0001946: add button for keyword normalization
Description	We should have a button to normalize keywords when searching/publishing files (in addition to the automatic normalization of keywords when publishing, which we might want to make optional).
Tags	No tags attached.

LRN 2011-12-16 08:36 developer ~0005104	What exactly this normalization consists of? Converting to lowercase? Also, when searching, do we normalize keywords, replacing the originals, or do we add normalized keywords to the search query?

Christian Grothoff 2011-12-16 20:14 manager ~0005112	See 'GNUNET_FS_uri_ksk_canonicalize'. Convert to lower case, remove vowels and special characters. As for how to apply this when searching (or not) and all of those usability issues: that's exactly the primary reason why I've not pursued this issue further yet: I don't have a good answer at this time. I like the idea (but with measure: strong passwords should also be valid 'keywords', so it must be an option); how to best canonicalize/normalize and how to best reflect this in the UI -- good question.

LRN 2011-12-17 01:21 developer ~0005118	Well, this is what i'm thinking: * If the query was submitted with Shift+click-on-Find or Shift+Enter (TODO: do actually submit queries on Enter...), then check the query normalization setting * If automatic normalization is enabled, Shift+Enter avoids it * If automatic normalization is disabled, Shift+Enter triggers the normalization * Do normalization unconditionally, but don't substitute the results, unless the conditions above are met (either normalize-enabled and Enter, or normalize-disabled and Shift+Enter). * If normalized and original versions differ (no matter what normalization settings are, and how the query was submitted), then check the warning message setting * If normalization warning is enabled, show it. Normalization warning tells the user about keyword normalization, shows the status of the normalization settings, and tells about Shift+Enter and Shift+Click. There will be a checkbox on the message, that allows the user to prevent it from showing up again (disable it). The only difficult bit here is the warning message, since we can't use gtk_dialog_run (), or anything else that blocks. All other things from the proposition above just require code and attention, and some agreed-upon standard for saving GTK settings (GSettings?).

Christian Grothoff 2011-12-17 01:27 manager ~0005119	Ugh, that's pretty complicated. But I like the idea of doing something with shift-enter. What if we simply normalized the keyword in the field if the user presses shift-enter? That way, I can search with normalization using shift-enter+enter and without normalization using just enter. Isn't that easy enough to use and -- once the user is aware, also easy to understand? I'm very much against having tons of additional options (normalize/warn/etc.). Next question: what do we do for the publishing side?

LRN 2011-12-17 02:14 developer ~0005121	Well, warning message could be substituted for a tip-of-the-day (i think i've mentioned this before - it's a great way to tell the user about this feature or that, without making the user read the docs), in which case everything becomes much simpler. As for publishing, it's the same principle, only in the reverse: Add normalized keywords, if they differ from the originals (no sense to replace them - it's good to have more keywords) by default. This is overridden by shoft+click and shift+enter. Or maybe have a separate keyword list for normalized keywords (which user can't edit, and which is updated automatically as the normal keywords list changes), and a checkbox "Automatically add normalized keywords when publishing" (which is enabled by default).

LRN 2011-12-18 08:56 developer ~0005125	Right now you have a "normalize" button in the namespace publishing dialog (but not in the similar normal publishing dialog; duplication of code between normal publising and namespace publishing dialogs is yet another matter). It normalizes selected keyword. This is unacceptable. Typically we'd want users to make publications with MORE keywords. If sane keywords could be produced automatically - they SHOULD be. That includes normalization. Which means that in 90% cases we should make publication with user-specified keywords, extracted keywords AND normalized versions of all keywords (extracted or user-specified). Since this is a typical case, it should require the least amount of user input. Which means that everything should work by itself, there should be no need to press any buttons. Special cases: 1) User wants to ONLY publish under a non-normalized keyword. 2) User wants to ONLY publish under a normalized keyword (and user is too lazy to try to figure out how to input it already normalized, (s)he just types it in as-is, and lets GNUnet do the job). Both (1) and (2) special cases are per-keyword. How to make (1): Select "Use only selected normalized keywords" checkbox. Normalized keywords won't be added automatically. User can select (a listbox with multi-selection), which normalized keywords will be added. How to make (2): Select "Use only selected keywords" checkbox. Only selected (in listbox with multi-select) keywords will be used. Normal keywords are displayed on one list, normalized ones - in another. Normal keywords list is pre-populated with all extracted keywords. List of normalized keywords is kept in sync with the list of normal keywords, and is not directly editable. Obviously, duplicates will be removed just before keywords are actually used.

Christian Grothoff 2011-12-18 11:44 manager ~0005126	Publishing is more complicated than you make it out to be here. The typical process I'm thinking about involves the user picking a directory to be published. Then, LE goes and extracts keywords. At that point, we can (should?) normalize those keywords; some are lifted up to the parent directory. Then the user can manually add more keywords. Again, those keywords we might normalize. I like the idea of not displaying the 'normalized' keywords in the normal keyword list. I'm not sure having a second list for the normalized keywords is a good idea. Maybe just have a check box 'also publish normalized keywords' and never even show them? I don't think there is a point in "only using normalized keywords", if I publish the normalized keywords I don't see a good case for withholding the non-normalized keywords. Removal of duplicates is already part of the ksk URI functionality, so that's not only obvious but enforced by the API already.

Christian Grothoff 2011-12-18 11:45 manager ~0005127	Also, we should pick the safest thing by default, so by default the 'publish normalized keywords' would have to be unchecked.

LRN 2011-12-18 23:58 developer ~0005129	we should pick the safest thing by default, so by default the 'publish normalized keywords' would have to be unchecked. Did you mean "checked"? If you didn't, then i'm not sure why.

Christian Grothoff 2011-12-19 13:50 manager ~0005133	By default, we should not normalize keywords as the keywords might actually be passwords where normalization would weaken them. However, we should urge users to turn it on by tooltip and/or tip-of-the-day.

LRN 2011-12-22 16:55 developer ~0005163	(2011-12-21 18:22:44) LRN: there's something wrong with keyword normalizer (2011-12-21 18:22:55) LRN: It doesn't normalize "A" to "a" (2011-12-21 18:24:00) grothoff-office: There is a limit, the result must be longer than 3 chars. (2011-12-21 18:24:17) grothoff-office: And vowels are discarded... (2011-12-21 18:25:04) LRN: "AAAAAAAAAAAA" is not normalized to "aaaaaaaaaaaa" (2011-12-21 18:27:28) grothoff-office: vowels are discarded... (2011-12-21 18:27:40) LRN: why? (2011-12-21 18:28:00) grothoff-office: That was part of the original suggestion for normalization. (2011-12-21 18:28:05) LRN: ... (2011-12-21 18:28:13) grothoff-office: Or should I say, nrmlztn (2011-12-21 18:28:22) LRN: Doesn't make any sense to me (2011-12-21 18:29:21) grothoff-office: I think someone went as far as suggesting to remove vowels and then sort the remaining letters alphabetically (in addition to converting to lower-case and removing special chars). (2011-12-21 18:30:03) LRN: that STILL won't get you approximate searches - so why even try? (2011-12-21 18:30:04) grothoff-office: I think it is all just a question of how "canonical" you need to be (and how much mis-spelling you can start to tolerate before everything becomes one word ;-)) (2011-12-21 18:30:25) grothoff-office: It's a trade-off. I have no empirical data to justify or falsify either position. (2011-12-21 18:35:24) LRN: You can take a dictionary, normalize every word, and then search for duplicates (2011-12-21 18:37:45) LRN: also, it replaces non-latin letters with '_' (2011-12-21 18:38:12) LRN: And here i thought that we could FINALLY have UTF-8 conformant search...Dream on! (2011-12-21 18:38:18) grothoff-office: dictionary says little about how often actual words are in real searches. (2011-12-21 18:38:49) grothoff-office: Now, the normal (non-canonicalized) GNUnet search should be fine with UTF-8. (2011-12-21 18:39:07) LRN: true (2011-12-21 18:39:20) LRN: But that means that normalization is non-existent for anything non-latin (2011-12-21 18:39:34) LRN: Which makes the value of normalization somewhat questionable (2011-12-21 18:39:43) LRN: you could at least try transliteration... (2011-12-21 18:39:56) LRN: instead of just discarding characters (2011-12-21 18:40:07) LRN: s/discarding/turning into _/ (2011-12-21 18:41:30) LRN: also, GNUNET_FS_uri_ksk_canonicalize() ADDS normalized version to the uri, not REPLACES the original (2011-12-21 18:41:53) LRN: While this MIGHT be the desired behaviour in some cases, it is NOT desired in other cases (2011-12-21 18:42:34) LRN: and it's difficult to work around (i could get back normalized string, then cut out first strlen(original) characters) (2011-12-21 18:43:20) grothoff-office: As usual, I'm open for patches ;-). (2011-12-21 18:43:37) grothoff-office: I never liked the idea of canonicalization much in the first place... (2011-12-21 18:51:17) LRN: i imagined that normalization ONLY removes special chars, converts to lowercase, and converts accented chars (and other chars that are modified versions of normal ones) to normal chars. (2011-12-21 18:52:12) LRN: With that you can publish under keyword X, normalizing it to X1, and someone MIGHT search for X1, because probability of typing in X1 is relatively high (2011-12-21 18:52:42) LRN: And there's a probability of publishing under X1 in the first place (2011-12-21 18:53:38) LRN: With the kind of normalization that goes on right now, there is ZERO probability that someone publishes or searches with normalized keyword UNLESS they are using normalization. Because no human being would type "nrmlztn" instead of "normalization" (2011-12-21 18:55:21) LRN: This makes normal-normal search-publishing more likely to succeed, while normal-common and common-normal search-publishings are less likely to succeed (2011-12-21 18:55:57) LRN: Since normalization is NOT mandatory, that seems like a bad idea to me. (2011-12-21 18:57:11) LRN: So, my suggestion: 1) Don't remove vowels. 2) Use transliteration for non-latin characters. (2011-12-21 18:57:35) LRN: (2) requires full unicode support (right now canonicalize_keyword() is strictly ASCII) (2011-12-21 18:58:55) LRN: I'm itching to re-implement normalizer the way i like (with UTF-8 support too!) in gnunet-fs-gtk. I know it's wrong, but it's itching! (2011-12-21 19:21:47) LRN: do you have any gnuish utf-8 string manipulation libraries? (2011-12-21 19:22:01) LRN: I don't feel like tying gnunet-core to glib... (2011-12-21 19:26:17) LRN: how about libunistring? (2011-12-21 19:30:41) grothoff-office: I'm not sure this is worth any additional dependencies... (2011-12-21 19:31:32) LRN: I'm not sure it is worth re-inventing the wheel by writing an UTF-8-aware normalizer (2011-12-21 19:39:30) LRN: H-m-m-...converting utf-8 -> UCS-4 (iconv can do that) and then working ucs4char-by-ucs4char might be trivial enough to implement on my own (2011-12-21 20:32:47) LRN: iconv is somewhat optional, as HAVE_ICONV might be undefined, but then we can just fall back to some kind of simplistic behaviour. (2011-12-21 20:33:35) LRN: Well, either all that, or making normalization THE default (2011-12-21 20:33:46) LRN: (both when publishing and searching) (2011-12-21 20:34:39) LRN: (although even in that case unicode vowels will have to be also removed) (2011-12-21 20:34:52) LRN: (which means some kind of unicode support) (2011-12-21 23:39:38) LRN: that SUX (2011-12-21 23:39:45) LRN: Why, oh WHY? (2011-12-21 23:40:29) LRN: libiconv has been developed for YEARS now. And it STILL doesn't have a transliteration table for cyrillic range (0x401-0x491). WTF!?!? (2011-12-22 19:34:19) LRN: grothoff, i spoke with libiconv devs, and they've said that libiconv only implements universal transliteration (mappings that everyone have agreed upon). No such mapping exists for some scripts (mappings are locale-dependent). Sadly for me, cyrillic is one of them. So the idea to use transliteration as part of string normalization is not going to work without an extra dependency. (2011-12-22 19:34:30) LRN: And i'm beginning to doubt its efficiency anyway (2011-12-22 19:36:37) LRN: If transliteration is made locale-dependent, then publisher and downloader will have to use the same locale, otherwise keywords are unlikely to match. If it's made locale-independent, then there's a high probability that some people will never type the keyword in the way it was specified, because it is typed differently for their locale (2011-12-22 19:49:39) LRN: And again: if you make normalization the default, it won't matter how bad it is (well, as long as it does some kind of crude locale-independent transliteration as well), it will work reasonably well. (2011-12-22 19:50:32) LRN: Right now i'm in favour of just disabling normalization completely and making do without it. In its present state it does nothing good.

Christian Grothoff 2011-12-23 10:55 manager ~0005169	... so, shall we just remove all of the code related to normalization?

LRN 2011-12-23 10:57 developer ~0005170	Code does no harm when it can't be used (and it can't, because related UI widgets are hidden), so i'd vote to keep it. If you ever enable normalization by default, it'll be useful.

Christian Grothoff 2011-12-24 22:51 manager ~0005200	Actually, canonicalization is already used (see fs_file_informatinon.c:373/751), so it is not "just" the GUI button we're talking about here. Also, if this idea is simply flawed to begin with, we should not try to maintain code for its potential inclusion.

Christian Grothoff 2011-12-25 22:16 manager ~0005205	"fixed" by elimination of all canonicalization/normalization code in SVN 18820/18821.

Date Modified	Username	Field	Change
2011-11-22 20:25	Christian Grothoff	New Issue
2011-11-22 20:38	Christian Grothoff	Priority	normal => low
2011-11-22 20:38	Christian Grothoff	Status	new => confirmed
2011-11-26 18:22	Christian Grothoff	Relationship added	child of 0001966
2011-12-16 08:36	LRN	Note Added: 0005104
2011-12-16 20:14	Christian Grothoff	Note Added: 0005112
2011-12-17 01:21	LRN	Note Added: 0005118
2011-12-17 01:27	Christian Grothoff	Note Added: 0005119
2011-12-17 02:14	LRN	Note Added: 0005121
2011-12-18 08:56	LRN	Note Added: 0005125
2011-12-18 11:44	Christian Grothoff	Note Added: 0005126
2011-12-18 11:45	Christian Grothoff	Note Added: 0005127
2011-12-18 23:58	LRN	Note Added: 0005129
2011-12-19 13:50	Christian Grothoff	Note Added: 0005133
2011-12-19 13:54	Christian Grothoff	Assigned To	=> LRN
2011-12-19 14:21	Christian Grothoff	Target Version	=> 0.9.1
2011-12-22 16:55	LRN	Note Added: 0005163
2011-12-23 10:55	Christian Grothoff	Note Added: 0005169
2011-12-23 10:57	LRN	Note Added: 0005170
2011-12-24 22:51	Christian Grothoff	Note Added: 0005200
2011-12-25 22:16	Christian Grothoff	Note Added: 0005205
2011-12-25 22:17	Christian Grothoff	Status	confirmed => resolved
2011-12-25 22:17	Christian Grothoff	Fixed in Version	=> 0.9.1
2011-12-25 22:17	Christian Grothoff	Resolution	open => won't fix
2011-12-26 22:30	Christian Grothoff	Status	resolved => closed

View Issue Details

Relationships

Activities

Issue History