0002233: Parallel extracor with full file access - MantisBT

ID	Project	Category	View Status	Date Submitted	Last Update

0002233	libextractor	libextractor main library	public	2012-03-20 17:08	2012-09-25 17:18

Reporter	LRN	Assigned To	Christian Grothoff
Priority	low	Severity	feature	Reproducibility	N/A
Status	closed	Resolution	fixed
Product Version	0.6.3
Target Version	1.0.0	Fixed in Version	1.0.0

Summary	0002233: Parallel extracor with full file access
Description	This is a part of IRC log where the idea came from: LRN: Anyway, point is: restricing metadata extraction to head or tail only is fine from performance point of view, but is horrible from conformance point of view. grothoff: LRN: I have been pondering that design decision as well; the issue is that we won't be able to mmap very large files (at least not in their entirety), so the IPC protocol would need to get MUCH more complicated... LRN: grothoff, maybe not so much: give head-plugins a head. Wait until they report. Those that managed to get everything out of the head and require nothing more will say so. Others will pass a seek request. Once all seek requests have been collected, do the one that requests the earliest position in the file (that is, change mapping position), then tell all plugins that a seek has been performed. LRN: Plugin that requested that seek will do processing, and possibly request another seek or report that it's done. Other plugins will re-request the same seeks (if the seek didn't suit them), or some new seeks (in case the seek also gave them the data they needed, and they need different parts of file now), or will report that they are done. Repeat until all plugins are done. LRN: no seeking backwards LRN: It will make plugins implementation a bit more complex, since plugins should be able to maintain their state between seeks, and a seek might need to be requested at any point (plugin might make use of a seek that was meant for some other plugin, in which case it might get only a part of data it needs for itself) grothoff: Sounds good. Still requires (slightly) more complicated IPC, but certainly doable.
Additional Information	How extractor works right now: 1) If no buffer is provided by the caller, mmap the beginning of the file as head 2) If file is large enough, and no buffer is provided by the caller - mmap the end of the file as tail 3) see if head begins with the magic zip or bz2 signature. If so, unpack the data in head and use the unpacked head from now on. tail, even if not NULL, is not unpacked 4) see if it runs OOP and shared memory is required. If so, starts the child processes, creates a shared memory object and copies data from the head to that object. If tail is not NULL, does the same for tail (so you have 2 shared memory objects) 5) if it is not running OOP, it calls extract method of each plugin directly, passing head and tail to it. Otherwise it does extract_oop() on each plugin, which means writing some data (names of the shared memory objects) to the plugin, and then waiting forever for it to reply (or die trying). What i don't like in this code: 1) The unpacking part only works for complete buffers (as far as my primitive understanding of lossless data compression goes). Seeking in compressed files is problematic. Uncompressing big files takes a lot of time and resources. Possible fix: see if the file is compressed. If so, uncompress it COMPLETELY in memory, if it's small enough and/or its uncompressed size is small enough, and then work with the compressed file in-memory. H-m-m-...i guess this is exactly how it works right now. The only difference is that this should done before mmap'ing the file (because if it's packed, mmaping it makes little sense). 2) OOP (the default mode, by the way) means that head and tail (if any) are memcpy'ed into shared memory objects, which, again, makes mmap() for head and tail unuseful. Shared memory objects themselves are accessed via mmaps though. Possible fix: don't mmap the file at all? 3) This code is not parallel. It works one plugin at a time, it doesn't do a "give the data to all plugins, then wait for all plugins to process it and finish, or die". Possible fix: difficult to tell. We don't have access to GNUNET_NETWORK_socket_select(), which means that we can't select on pipes (the extractor interacts with plugins by their std pipes, AFAIU), which means that it's difficult to do do waiting upon plugins without blocking. There's a SELECT() in plibc, but it's unspectacular (roughly as good as GNUnet NETWORK_socket_select() was ~2 years ago). Use threads? Are there any hidden thread-related issues, like the one with grypt and libc? Is that is not an option, i might be able to concoct a select()-only-for-pipse that works fast enough on W32, if we guarantee that all pipes are overlapped (that we can do), although it will definitely duplicate GNUnet util code to some extent. Seeking. It might be done in two different ways: 1) Overwrite shared memory object with different chunks that are read from the source file. Easy to adapt to reading data from a buffer instead of a file. Need only to communicate the new shared memory size to plugins. 2) Create not a memory-backed mapping (on W32; not sure what backs shared memory on POSIX-compliant systems - they seem to be created in / or /tmp/), but a file-backed mapping, specifying the real file as its source. Plugins will be able to read file at their leisure by re-mapping different parts of the same shared memory object! Should work well enough with data buffers as well - just make a big map of a shared memory object at server side, dump the buffer into it, and then let the clients map it as they want.
Tags	No tags attached.

LRN 2012-03-30 18:21 reporter ~0005665	After recent architecture update LE now works like this: 1) reads first 4 bytes of the file (or memory buffer) 2) if it's compressed and small - uncompresses it into another memory buffer and continues to use that buffer, otherwise bails out 3) gives uncompressed buffer OR memory buffer OR file id+4 bytes to extractor 4) extractor creates a memory-backed shared memory object (shm), starts all processes, initializes them (on W32; on POSIX plugin initialization is a noop), then initializes state for each plugin (calls init_state method), then fills the shm with data from the memory buffer OR from the file, and tells every plugin the number of bytes in shm, then waits for plugin replies 5) every plugin opens shm, then waits. When it gets the number of bytes, it maps that number of bytes from shm and processes them, calling back with metadata (sending it over a pipe), as it goes. When it runs out of data, it decides where to seek, and tells that to the extractor, and waits for the next shm update 6) extractor processes replies and seek requests. If it was a memory buffer, extractor finalizes, otherwise it calculates the next seek offset (minimal of all seek requests) and seeks the file, then reads another chunk into shm, updates shm size to plugins, and waits for replies 7) plugins get the size, map shm, then process the data from the point where they ran out of bytes 8) when extractor is done, it tells plugins to discard their state, then returns This approach forces every plugin to be a state machine to some degree, because it might run out of bytes at any moment, and it must return and wait for the next update before continuing, and for that it needs to preserve its state outside of the stack, and be able to work correctly from any valid state. That is somewhat inconvenient (although it does make sense for certain plugins, such as EBML files where file contents are highly variable). Also, it means that initial memory buffer (if it's a memory buffer) is copied into shm, practically wasting memory (the buffer contents aren't modified anyway). It also means that extractor does mundane sequential 32MB-reads from the file (if it's not a memory buffer) into shm. I'm not sure whether that is optimal or not. Also, plugins can't request to seek backwards. On the bright side: all plugins work with the same shm, and map the same part of it (that is, the whole shm, except for the last chunk), which might be memory-efficient (depends on OS, i guess).

LRN 2012-03-30 18:44 reporter ~0005666	Further development: mmap tests on Debian GNU/Linux (x86_64, Linux-3.2.x) and NT 6.1 have shown that the system is able to provide adequate level of performance when mapping the real file directly (instead of copying it chunk-by-chunk into shm and mapping that), and performance doesn't drop much when plugins are mapping different parts of the file in random sequences, reading a bytes from random locations of the mapped region. However, performance does drop sharply when mapped blocks are small (32KB instead of 32MB), even if number of random reads is reduced accordingly (1 instead of 1024). That is, it's still faster to map large chunks and seek within them, than to map lots of small chunks. Because of that i have decided to replace current implementation with the following: 1) LE reads first 4 bytes of the file (or memory buffer) 2) if it's compressed and small - uncompresses it into another memory buffer and continues to use that buffer, otherwise bails out 3) gives uncompressed buffer OR memory buffer OR file id to extractor 4) extractor starts all processes, initializes them (on W32; on POSIX plugin initialization is a noop), then initializes state for each plugin (calls init_state method), then waits for plugin replies. 5) Every plugin gets file id and size from LE (on POSIX it gets file id via fork(), on W32 we need to duplicate file handle, then pass its value to the child the same way this is done at the moment - via pipe), maps first chunk of the file, and runs its exctract method, where it keeps parsing the file and reporting back metadata, until it runs out of bytes in the mapped region. Then it re-maps to a different chunk (all within its read() wrapper) and continues reading. Until it runs out of file size. Then it tells LE that it's done, unmaps and closes its copy of file id. 6) LE collects replies, then returns once all plugins report that they are finished If LE has to work with a memory buffer, it creates memory-backed shm, just as before, and passes its id instead of the file id (plugins won't know the difference - both on POSIX and W32). Plugins won't need to maintain out-of-stack state, which simplifies the implementation back to pre-read-through-architecture level (utility code will still need to do some wrapping for read() calls, but that's easy compared to state machines). They also won't need init_state() and discard_state() methods. That said, EBML plugin will probably remain highly state-based in its logic.

Christian Grothoff 2012-03-30 19:55 manager ~0005667	I'm thinking we're still doing it wrong. My current feeling is that we should provide a blocking file-IO read-only API to the plugins so that we do NOT have to convert them to complex state machines. So the plugins would call functions like "LE_PLUGIN_seek", "LE_PLUGIN_read", and maybe "LE_PLUGIN_get_size" (open/close would not be needed) on a 'file handle' that they're given during the initial plugin call. The functions then do the IPC with the master process in a blocking way, so 'read' can either instantly succeed if the desired area happens to be in the SHM range, or block until the master process has updated the SHM segment and notified the plugin (via pipe IPC) that the position is now set to the desired location. That will make it trivial to interact with libraries that expect a file-IO API, and also make it easier to migrate the existing plugins without changing them to a state machine.

Christian Grothoff 2012-03-30 19:57 manager ~0005668	Also note: I'm sure 'compress' (& gzip) can do decompression in a streaming fashion, that is given only the first N kb of a file already provide the first M decompressed KB of that file. So we can support compression for very large files as well, only seeking backwards would be VERY expensive (as we'd need to decompress from the beginning).

LRN 2012-03-30 21:15 reporter ~0005669	My current feeling is that we should provide a blocking file-IO read-only API to the plugins so that we do NOT have to convert them to complex state machines. So the plugins would call functions like "LE_PLUGIN_seek", "LE_PLUGIN_read", and maybe "LE_PLUGIN_get_size" (open/close would not be needed) on a 'file handle' that they're given during the initial plugin call. Here, let me quote myself: Every plugin gets file id and size from LE ..., maps first chunk of the file, and runs its exctract method, where it keeps parsing the file and reporting back metadata, until it runs out of bytes in the mapped region. Then it re-maps to a different chunk (all within its read() wrapper) and continues reading. Until it runs out of file size. Then it tells LE that it's done, unmaps and closes its copy of file id. ... Plugins won't need to maintain out-of-stack state, which simplifies the implementation back to pre-read-through-architecture level (utility code will still need to do some wrapping for read() calls, but that's easy compared to state machines). They also won't need init_state() and discard_state() methods. This will not only eliminate the state-saving requirement (as i've said already), but will also remove the need for LE to manage seeking for plugins (plugins will do re-maps themselves). Now, your latter comment about streaming decompression rings true, and it certainly complicates matters to the point where we do need plugins to tell LE that they need to seek to a particular position. With some thought this can be done transparently for plugins - that is, plugins will have to refrain from seeking backwards, while the API will take care of EITHER signalling LE that it needs to seek (then waiting for the answer and arranging for the plugin to read from the right place) OR re-mapping a different region of a file (in case we're working with a real file). That is, chunk-sized-shm-refilled-by-LE-parent-on-demand and file-backed-mmap-managed-by-LE-child are two different modes of operation ("LE-parent" and "LE-child" here underline the fact that LE code will handle the details; plugin will just use read() and seek() wrappers).

LRN 2012-04-04 13:02 reporter ~0005695	OK, this is, apparently, a bit more difficult. The problem came from unexpected source - in-process plugins. If state machinery is to be abandoned, LE has to implement read() and seek() wrappers for plugins to use, and extract method will be called only once for each plugin. The wrappers will check if plugin ran out of buffer, and will do re-mapping, or will ask LE server to update shm (in case of reading compressed file/memory) - blocking within the call. Problem arises when we combine on-the-fly unpacking with in-process plugins. On-the-fly unpacking means reading chunks from the file and feeding them to the unpacker until we run out of data, or the output buffer is full, at which point we give the buffer to extract method. After extract method processes the buffer, it asks for the next one, and we go back to feeding the unpacker. That is, plugin cannot block in a wrapper function to unpack more data; if we do that, we will have to re-unpack it multiple times (once for each plugin). This is not a problem for OOP plugins, because they work in parallel, but in-process plugins are invoked sequentially. Possible ways to fix this: 1) Don't uncompress on the fly (i.e. uncompress small files into memory and proceed reading from memory; that's how SVN HEAD does it right now; it certainly simplifies the implementation). 2) Don't use in-process plugins when uncompressing on the fly. 3) Keep using state-preserving approach (in that case i'll only have to implement on-the-fly uncompression, and leave everything in SVN HEAD as it is).

Christian Grothoff 2012-04-04 13:11 manager ~0005696	Good point. I'm thinking maybe we should just ditch in-process processing entirely except for debugging of a single plugin (so if we are running with only 1 plugin, the problem you mentioned goes away, and having the ability to run in-process in that case would be enough to facilitate debugging bugs in the plugins).

Christian Grothoff 2012-08-23 19:58 manager ~0006281	This now seems to work.

Date Modified	Username	Field	Change
2012-03-20 17:08	LRN	New Issue
2012-03-30 18:21	LRN	Note Added: 0005665
2012-03-30 18:44	LRN	Note Added: 0005666
2012-03-30 19:55	Christian Grothoff	Note Added: 0005667
2012-03-30 19:57	Christian Grothoff	Note Added: 0005668
2012-03-30 21:15	LRN	Note Added: 0005669
2012-04-04 13:02	LRN	Note Added: 0005695
2012-04-04 13:11	Christian Grothoff	Note Added: 0005696
2012-08-23 19:58	Christian Grothoff	Note Added: 0006281
2012-08-23 19:58	Christian Grothoff	Status	new => resolved
2012-08-23 19:58	Christian Grothoff	Fixed in Version	=> Git master
2012-08-23 19:58	Christian Grothoff	Resolution	open => fixed
2012-08-23 19:58	Christian Grothoff	Assigned To	=> Christian Grothoff
2012-09-09 02:35	Christian Grothoff	Product Version	=> 0.6.3
2012-09-09 02:35	Christian Grothoff	Fixed in Version	Git master => 1.0.0
2012-09-09 02:35	Christian Grothoff	Target Version	=> 1.0.0
2012-09-25 17:18	Christian Grothoff	Status	resolved => closed