0002048: Non-standard-compliant behaviour is desired of id3 extractor? - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0002048	libextractor	plugins	public	2011-12-29 18:34	2012-09-25 17:18

Reporter	LRN	Assigned To	Christian Grothoff
Priority	normal	Severity	major	Reproducibility	N/A
Status	closed	Resolution	suspended
Product Version	0.6.3
Target Version	1.0.0	Fixed in Version	1.0.0

Summary	0002048: Non-standard-compliant behaviour is desired of id3 extractor?
Description	id3 has been fucked up for a long time. It initially mandated a very narrow set of encodings for strings, namely - ISO-8859-1 and UCS-2. Despite the fact that UCS-2 is pretty decent, it never saw much use, and ISO-8859-1 can't even hold a candle to it. Software developers tended to use either UTF-8 (which was later codified as a standard in id3v24), or use local encodings (which was a very wide-spread practice, and might still be). As a result, mp3 files (which rely exclusively on id3) often lie about their encoding and use locale-dependent encoding. CP1251, for example. The plugin, however, sticks to the standard, and tries to use ISO-8859-1, and you can guess the results. The questions are: 1) What to do about it? 2) Who is responsible of doing something about it? One of the answers is to tell users: "OK, listen, if you want GNUnet to automatically figure out metadata for your files, then keep them neat and standard-compliant". In which case libextractor won't need patching, but the files will. Which will be, obviously, user's responsibility. And it's more difficult than it sounds, especially if you restrict yourself to using free software tools on W32. Another answer is to try to look for non-standard encodings (how?) and use the proper conversion, instead of following the standard to the letter. I have no answers for the "how to detect non-standard encodings" question. Or this behaviour might be made optional, in which case the user will adjust the preferences, if non-compliant behaviour is needed. Problem is, i don't remember seeing any kind of "options" or "ini files" for libextractor, and no API to tweak its behaviour. Yet another strategy is to NOT to decode strings in LE, but pass them to the client (possibly with a hint about the encoding LE THINKS should have been used), and let the client to the decoding (with or without user's help; client does have a way to communicate with the user, GNUnet-fs-gtk certainly can do that). Obviously, that will take a lot of patching.
Tags	No tags attached.

Christian Grothoff 2012-01-03 07:55 manager ~0005240	Maybe provide a way to override ID3-tag charset via environment variable? I just see no good way for us to "guess" the encoding, and what might work better for you now might be much worse for others later. Override via environment variable (plus documentation -- ok, LE docs are a bit lacking to begin with) would seem like the best solution as each user can set it to fit his collection. Only issue is obviously that the user would need to do 'something', but I don't see a way to do this automatically. If you do, let me know...

LRN 2012-01-03 12:07 reporter ~0005241	Oh-kay, something like this: Every time, before a file is scanned, get a global environment variable [1] "GNUNET__LANGS", where "" is the name of the plugin. Parse it, extract a list of language:encoding pairs. If length of the list > 0, every time a string is encountered that uses single-byte encoding (unsuitable for non-latin locales), LE tries to convert the filename from every encoding in the list to UTF-8 (or UCS-4, doesn't really matter), then evaluates every result, looking for a string that produces the best match for the given language from the pair, and uses that result. How to do matching is language-dependent, i guess. There are ways to detect, with good probability, that a conversion is garbled, if you know the language you're supposed to get a string in. For CP1251 you can check that the number of normal (lower- and upper-case, not "funny" in any way) characters is significant, and that you don't get more than 4 "funny" characters in a row. Also, if you get a string that looks like "P^P#P&P P...", then it was originally in UTF-8. But again, it's very language-dependent. Which is why the language is also specified in a pair, not just encoding. [1] In reality, on W32 a global variable lives in the registry, and will have to accessed directly; getenv() will only pull variables from process-specific envtable. Not sure about other OSes, but something can be done there as well. OR i might put some effort into improving existing tag-correcting pieces of free software and tell uses to use that. OR i might write one myself OR i might add a context menu for metadata entries in publication editing window, where you can "fix" the encoding by choosing one of several re-encoding results that looks right for you. But that will only work for a single file, you'd have to go through every file and fix things by hand, that's tedious. Maybe someone did research on this?

Christian Grothoff 2012-01-03 13:25 manager ~0005242	Well, my idea was that the environment variable should be "EXTRACTOR__CHARSET", not "GNUNET" as this is about LE. Now, your idea is different in that you try to see if the result is proper in a particular language. That would require us to provide some data set for each language; I'm not sure that's viable. I agree that re-encoding in the GUI is likely too tedious as well (however, a GUI-method of changing the environment variable might work, like menu where you can first select the LE plugin that supports/needs this and then the encoding and as a result the encoding for the respective plugin is changed; if the selection is persistent (stored in a configuration file), users might play around until the plugin for their content has the right setting). So instead of correcting ONE item by hand manually, offer a means to correct all* items forever? (and without a heuristic like the language-detection stuff, which could fail?)

Christian Grothoff 2012-08-24 23:23 manager ~0006283	Given that ID3 tags are now done by gstreamer, I think the problem doesn't really lie within our domain anymore.

Date Modified	Username	Field	Change
2011-12-29 18:34	LRN	New Issue
2012-01-03 07:55	Christian Grothoff	Note Added: 0005240
2012-01-03 12:07	LRN	Note Added: 0005241
2012-01-03 13:25	Christian Grothoff	Note Added: 0005242
2012-01-05 22:08	Christian Grothoff	Status	new => feedback
2012-08-24 23:23	Christian Grothoff	Note Added: 0006283
2012-08-24 23:24	Christian Grothoff	Status	feedback => resolved
2012-08-24 23:24	Christian Grothoff	Fixed in Version	=> Git master
2012-08-24 23:24	Christian Grothoff	Resolution	open => suspended
2012-08-24 23:24	Christian Grothoff	Assigned To	=> Christian Grothoff
2012-09-09 02:34	Christian Grothoff	Fixed in Version	Git master => 1.0.0
2012-09-09 02:34	Christian Grothoff	Target Version	=> 1.0.0
2012-09-25 17:18	Christian Grothoff	Status	resolved => closed