View Issue Details

IDProjectCategoryView StatusLast Update
0002048libextractorpluginspublic2012-09-25 17:18
ReporterLRN Assigned ToChristian Grothoff  
PrioritynormalSeveritymajorReproducibilityN/A
Status closedResolutionsuspended 
Product Version0.6.3 
Target Version1.0.0Fixed in Version1.0.0 
Summary0002048: Non-standard-compliant behaviour is desired of id3 extractor?
Descriptionid3 has been fucked up for a long time. It initially mandated a very narrow set of encodings for strings, namely - ISO-8859-1 and UCS-2. Despite the fact that UCS-2 is pretty decent, it never saw much use, and ISO-8859-1 can't even hold a candle to it. Software developers tended to use either UTF-8 (which was later codified as a standard in id3v24), or use local encodings (which was a very wide-spread practice, and might still be).
As a result, mp3 files (which rely exclusively on id3) often lie about their encoding and use locale-dependent encoding. CP1251, for example. The plugin, however, sticks to the standard, and tries to use ISO-8859-1, and you can guess the results.

The questions are:
1) What to do about it?
2) Who is responsible of doing something about it?

One of the answers is to tell users: "OK, listen, if you want GNUnet to automatically figure out metadata for your files, then keep them neat and standard-compliant". In which case libextractor won't need patching, but the files will. Which will be, obviously, user's responsibility. And it's more difficult than it sounds, especially if you restrict yourself to using free software tools on W32.

Another answer is to try to look for non-standard encodings (how?) and use the proper conversion, instead of following the standard to the letter. I have no answers for the "how to detect non-standard encodings" question.

Or this behaviour might be made optional, in which case the user will adjust the preferences, if non-compliant behaviour is needed. Problem is, i don't remember seeing any kind of "options" or "ini files" for libextractor, and no API to tweak its behaviour.

Yet another strategy is to NOT to decode strings in LE, but pass them to the client (possibly with a hint about the encoding LE THINKS should have been used), and let the client to the decoding (with or without user's help; client does have a way to communicate with the user, GNUnet-fs-gtk certainly can do that). Obviously, that will take a lot of patching.
TagsNo tags attached.

Activities

Christian Grothoff

2012-01-03 07:55

manager   ~0005240

Maybe provide a way to override ID3-tag charset via environment variable? I just see no good way for us to "guess" the encoding, and what might work better for you now might be much worse for others later. Override via environment variable (plus documentation -- ok, LE docs are a bit lacking to begin with) would seem like the best solution as each user can set it to fit his collection.

Only issue is obviously that the user would need to do 'something', but I don't see a way to do this automatically. If you do, let me know...

LRN

2012-01-03 12:07

reporter   ~0005241

Oh-kay, something like this:
Every time, before a file is scanned, get a global environment variable [1] "GNUNET_*_LANGS", where "*" is the name of the plugin.
Parse it, extract a list of language:encoding pairs.
If length of the list > 0, every time a string is encountered that uses single-byte encoding (unsuitable for non-latin locales), LE tries to convert the filename from every encoding in the list to UTF-8 (or UCS-4, doesn't really matter), then evaluates every result, looking for a string that produces the best match for the given language from the pair, and uses that result.
How to do matching is language-dependent, i guess. There are ways to detect, with good probability, that a conversion is garbled, if you know the language you're supposed to get a string in. For CP1251 you can check that the number of normal (lower- and upper-case, not "funny" in any way) characters is significant, and that you don't get more than 4 "funny" characters in a row. Also, if you get a string that looks like "P^P#P&P P...", then it was originally in UTF-8. But again, it's very language-dependent. Which is why the language is also specified in a pair, not just encoding.

[1] In reality, on W32 a global variable lives in the registry, and will have to accessed directly; getenv() will only pull variables from process-specific envtable. Not sure about other OSes, but something can be done there as well.

OR i might put some effort into improving existing tag-correcting pieces of free software and tell uses to use that.

OR i might write one myself

OR i might add a context menu for metadata entries in publication editing window, where you can "fix" the encoding by choosing one of several re-encoding results that looks right for you. But that will only work for a single file, you'd have to go through every file and fix things by hand, that's tedious.

Maybe someone did research on this?

Christian Grothoff

2012-01-03 13:25

manager   ~0005242

Well, my idea was that the environment variable should be "EXTRACTOR_*_CHARSET", not "GNUNET" as this is about LE. Now, your idea is different in that you try to see if the result is proper in a particular language. That would require us to provide some data set for each language; I'm not sure that's viable. I agree that re-encoding in the GUI is likely too tedious as well (however, a GUI-method of changing the environment variable might work, like menu where you can first select the LE plugin that supports/needs this and then the encoding and as a result the encoding for the respective plugin is changed; if the selection is persistent (stored in a configuration file), users might play around until the plugin for their content has the right setting). So instead of correcting ONE item by hand manually, offer a means to correct *all* items forever? (and without a heuristic like the language-detection stuff, which could fail?)

Christian Grothoff

2012-08-24 23:23

manager   ~0006283

Given that ID3 tags are now done by gstreamer, I think the problem doesn't really lie within our domain anymore.

Issue History

Date Modified Username Field Change
2011-12-29 18:34 LRN New Issue
2012-01-03 07:55 Christian Grothoff Note Added: 0005240
2012-01-03 12:07 LRN Note Added: 0005241
2012-01-03 13:25 Christian Grothoff Note Added: 0005242
2012-01-05 22:08 Christian Grothoff Status new => feedback
2012-08-24 23:23 Christian Grothoff Note Added: 0006283
2012-08-24 23:24 Christian Grothoff Status feedback => resolved
2012-08-24 23:24 Christian Grothoff Fixed in Version => Git master
2012-08-24 23:24 Christian Grothoff Resolution open => suspended
2012-08-24 23:24 Christian Grothoff Assigned To => Christian Grothoff
2012-09-09 02:34 Christian Grothoff Fixed in Version Git master => 1.0.0
2012-09-09 02:34 Christian Grothoff Target Version => 1.0.0
2012-09-25 17:18 Christian Grothoff Status resolved => closed