View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0009053 | Taler | exchange | public | 2024-08-09 20:29 | 2025-06-05 12:14 |
Reporter | Christian Grothoff | Assigned To | Christian Grothoff | ||
Priority | high | Severity | feature | Reproducibility | N/A |
Status | assigned | Resolution | open | ||
Platform | i7 | OS | Debian GNU/Linux | OS Version | squeeze |
Product Version | git (master) | ||||
Target Version | 1.0 stretch goals | ||||
Summary | 0009053: add support for automated sanction list processing [2d] / 8.6.25 | ||||
Description | We first need to get our hands on an actual sanction list so we know what the format looks like. | ||||
Tags | compliance | ||||
parent of | 0010072 | confirmed | some inputs fail to yield score, depending on sanction list and input | |
Not all the children of this issue are yet resolved or closed. |
|
Todo: - actual helper(s) to evaluate sanctions list against attributes - testing! - improve threshold formulas |
|
Vint has delivered: https://git.disroot.org/lnrs/kycheck |
|
I've compiled the code, several issues: 1) Downloaded the consolidated-list_2025-05-15.xml from https://www.sesam.search.admin.ch/sesam-search-web/pages/downloadXmlGesamtliste.xhtml?lang=en&action=downloadXmlGesamtlisteAction, then imported the 37 MB file via ~/.local/bin/kycheck --input ~/Downloads/consolidated-list_2025-05-15.xml Result: kyccheck takes 60s compute time on a Threadripper 1950X and consumes 1000 GB of virtual and 6 GB of actual RAM on-load. WTF? Note that xmllint parses the same XML in less than 1s on this system. There is also no conceivable reason to use significantly more RAM than the size of the list, so 64 MB would be fine, but 6 GB is out-of-this-world! Note that we will intend to deploy GNU Taler on systems with less memory than this, you're using more than everything else combined! Plus it is awfully slow. 2) Next I tried to use it, and pasted '{"company_name" : "Eindhoven University of Technology", "id" : "abcdef", "address" : { "country" : "NL", "street_name" : "Groene Loper", "street_number" : "3", "zipcode" : "5612 AE", "town_location" : "Eindhoven" } }' from your example input into STDIN. The result was: '"Could not decode JSON (\"Error in $: not enough input\"), please try again"'. Note the malformed error message, and the fact that I'm not getting what was promised either... I also tried just giving '{}' as the input (empty JSON object), same error. So it's definitively not a syntax error in the input. 3) I made one tiny modification to your build system, setting 'enable: false' in stack.yaml. After that, I could kind-of build on Debian stable just using 'stack build' (without NixOS installed). It still insisted on re-installing (!) the same version (!) of ghc, so I'm still not happy with the build system as the current state is not reasonable for creating Debian packages. Update: Figured out how to build a nice Debian package in 4c475a4..ec1fdaa -- did not need any significant changes to the build system, just command-line arguments to override. Nice! |
|
4) The project has some insane dependencies. Like crypton, building Twofish and other cryptographic primitives. An HTTP client library. CBOR. ASN1. Socks5 support. iproute. blaze-html, zlib. In the end, you have a 53 MB binary (plus external C code) for 3000 lines of code! This is just wrong on many levels, impossible supply chain. I understand type-safety is easily confused with "safe dependency", but this is too much by far for what the project needs. 5) Adding "system-ghc: true" to stack.yaml seems to convince it to use the local compiler. ;-). Update: 4c475a4..ec1fdaa uses just command-line arguments to override to make the build work nicely on Debian. |
|
6) Compiler warning: kycheck/app/Main.hs:76:85: warning: [GHC-18042] [-Wtype-defaults] • Defaulting the type variable ‘a0’ to type ‘Integer’ in the following constraints (Show a0) arising from a use of ‘show’ at app/Main.hs:76:85-88 (Integral a0) arising from a use of ‘floor’ at app/Main.hs:76:92-96 • In the first argument of ‘($)’, namely ‘show’ In the second argument of ‘(++)’, namely ‘(show $ floor $ diffUTCTime start (UTCTime age 0))’ In the second argument of ‘($)’, namely ‘"Seconds since epoch: " ++ (show $ floor $ diffUTCTime start (UTCTime age 0))’ | 76 | Just age -> print $ "Seconds since epoch: " ++ (show $ floor $ diffUTCTime start (UTCTime age 0)) | |
|
Notes on how to build the Debian package: # apt install -t testing cdbs haskell-stack debhelper dhall haskell-devscripts-minimal ghc $ dpkg-buildpackage -rfakeroot -b -uc -us (with my debian/ folder). This worked on my office system, now locally I get "Failed to find C++ standard library", so probably some dependency is still missing in the list... |
|
Installing g++-14 worked, not sure why g++-12 wasn't enough... Update: Made g++-14 a dependency for the Debian build. |
|
(7) When giving --silent, it still prints '"Seconds since epoch: 26647950"'. Not sure why. This also goes to stdout, which is quite bad as it'd break the parser that expects to receive results in JSON. We should make sure to log at best to stderr and make sure stdout is strictly limited to the JSON result. Update: logging changed to stderr, not sure what the output is supposed to tell us still. But harmless. (8) It expects "quit" to be entered to, well, quit. Which I guess is OK, but it should also just quit on CTRL-D (end of stream). Which it does, except with an error message (even on 'silent'): 'kycheck: <stdin>: hGetLine: end of file'. I'm not sure we need the "quit" feature, and I think it would be nicer (given that the main interaction with this tool will not be by humans) if these outputs were removed (or at least left on only in --debug mode or so). Update: now implemented to exit cleanly on CTRL-D (and always on invalid input). |
|
(9) "Could not decode JSON (\"Unexpected end-of-input, expecting JSON value\"), please try again" is also logged to stdout instead of stderr. Here I just submitted "new line + CTRL-D". I'm OK with the code insisting on each line being a JSON input, but we should figure out how to do error handling nicely. The application will expect a JSON result per JSON input on stdout, so probably a good way to do it would be to log a human-readable error to stderr, and define some JSON format for the error on stdout, like: '{"status":"error", "code":42, "hint":"..."}'. We could then extend the normal output format to something like '{"status" : "success", match_quality = 0.85357136, confidence = 0.9311688, expiration = 0, reference = 73508}'. Update: logging fixed to go to stderr, plus CTRL-D is now implemented, plus output is no longer in JSON-ish but in the sscanf() format expected by taler-exchange-sanctionscheck. |
|
(10) I now got something close to the expected output for 'normal' person: >> $ /tmp/bin/kycheck --silent --input files/consolidated-list_2024-07-30.xml 2> /dev/null "Seconds since epoch: 26648230" {"full_name" : "Maria Consuela", "last_name" : "", "address" : { "country" : "GT", "street_name" : "Unknown", "street_number" : "", "zipcode" : "" }, "birthdate" : "1953-06-23", "nationality" : "GT", "national_id" : "" } Score {match_quality = 0.85357136, confidence = 0.9311688, expiration = 0, reference = 73508} << what is not great is the "Score" prefix, that just makes parsing harder. Furthermore, the syntax is only *almost* JSON, we should use '{"status":"success","match_quality": 0.85357136, "confidence": 0.9311688, "expiration": 0, "reference": 73508}' so that the main process can run this through a regular JSON parser and not something custom. UPDATE: I've checked, and the C code actually used sscanf (buf, "%lf %lf %llu %1023s", &rating, &confidence, &expire, best_match)) for parsing the robocop output. Git ec1fdaa..53ec32c modified the robocop output to match that requirement. (11) I also wonder what the "expiration" of 0 means here. "never"? Already expired? What's the unit? Update: Unit is in seconds, changed C code to interpret 0 as "forever", which is I guess the conservative solution. We should still figure out if 0 expiration is a bug or simply that the sanction list fails to say. |
|
(2) Is now better for normal users (see above), but I still cannot get the business match to work, I tried: $ /tmp/bin/kycheck --silent --input files/consolidated-list_2024-07-30.xml 2> /dev/null {"company_name" : "Eindhoven University of Technology", "id" : "abcdef", "address" : { "country" : "NL", "street_name" : "Groene Loper", "street_number" : "3", "zipcode" : "5612 AE", "town_location" : "Eindhoven" } } "Seconds since epoch: 26648725" and there is simply no output at all (no error, nothing). The result is the same even with stderr / without --silent. |
|
(1) was explained by Michiel to be the pre-processing / building FSMs for fast matching later, which makes sense (especially now that I can see the runtime even with the pre-processing). I thought the code was just parsing the XML, so that explains the memory consumption and the loading time. On the TOPS production system, I checked and we have more than enough RAM, and GLS hasn't yet indicated that they will want us to do sanction enforcement, so for now the RAM usage is no real concern. I'm still uncomfortable with the dependency chain (but the current binary is MUCH smaller already, 22 MB, massive reduction!). But I think this can easily be fixed by: (a) splitting the cool (and in principle generic!) matching logic from the (complex) XML parsing: convert Swiss XML to some "internal" JSON format first -- which can be done away from the production system and we will care much less about the dependency chain here -- and then (b) load our internal JSON into the production matching logic which ONLY does the approximate matching and then only needs JSON inputs (stdin, --input) and produces JSON outputs and thus doesn't *need* an XML parser as a dependency anymore. This would also have the advantage that if we in the future get sanction lists in other countries in other formats (CSV, JSON, different XML, who knows!) we can just write a converter to the "internal" JSON format (which should be documented...) and can keep the interesting core logic unchanged. |
|
(12) I think we should (re)consider the binary name. "kycheck" is a bit short and IMO not memorable enough. Given that it is about automated enforcement of sanctions, "robocop" comes to mind (or, if you perfer, "murphy", which via Murphy's law also gives the idea that things could go wrong / that the matching is statistical). robocop is not yet taken, which is also good, there is a libmurphy in Debian, which is a minor concern for that name. If/when we do the split, we also need a 2nd name for the XML 2 JSON converter, but there I'd go for something purely functional, such as "robocop-helper-ch-xml-converter". WDYT? Update: in absence of a response, I've changed everything to "robocop". |
|
Starting point for Debian package (to be renamed) attached. We should also move the git to git.taler.net, will need Vint's ssh public key for that. Update: Found Vint's key, robocop.git created. |
|
Worked on taler-exchange-sanctionscheck today, added: - options for configurable thresholds - incremental processing, storing last seen row ID - background mode waiting for notfiy - reset command-line option to work on new sanction list Todo: - DB notify is missing on kyc attribute insertion - testing of taler-exchange-sanctionscheck (manual, automated, etc.) - integration test with Haskell tool - ansible integration |
|
There seems to be only one issue (0010072) left before Robocop is usable. Main Todos here: - integration test - ansible integration |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-08-09 20:29 | Christian Grothoff | New Issue | |
2024-08-09 20:29 | Christian Grothoff | Status | new => assigned |
2024-08-09 20:29 | Christian Grothoff | Assigned To | => Christian Grothoff |
2024-08-19 09:01 | Christian Grothoff | Target Version | 0.14 => 1.0 |
2024-08-23 00:24 | Christian Grothoff | Target Version | 1.0 => 1.0 stretch goals |
2024-08-24 10:40 | Christian Grothoff | Summary | add support for sanction lists => add support for sanction lists [5d] |
2024-09-14 00:57 | Christian Grothoff | Priority | urgent => high |
2025-01-05 16:24 | Christian Grothoff | Note Added: 0023938 | |
2025-01-05 16:24 | Christian Grothoff | Summary | add support for sanction lists [5d] => add support for sanction lists [4d] |
2025-01-05 23:22 | Christian Grothoff | Note Edited: 0023938 | |
2025-01-12 09:17 | Christian Grothoff | Note Edited: 0023938 | |
2025-04-17 22:21 | Christian Grothoff | Tag Attached: compliance | |
2025-05-07 16:51 | Florian Dold | Summary | add support for sanction lists [4d] => add support for automated sanction list processing [4d] |
2025-05-09 09:23 | Christian Grothoff | Note Added: 0024860 | |
2025-05-29 15:56 | Christian Grothoff | Note Added: 0025052 | |
2025-05-29 16:15 | Christian Grothoff | Note Added: 0025053 | |
2025-05-29 16:16 | Christian Grothoff | Note Added: 0025054 | |
2025-06-03 01:31 | Christian Grothoff | Summary | add support for automated sanction list processing [4d] => add support for automated sanction list processing [4d] / 8.6.25 |
2025-06-03 11:49 | Christian Grothoff | Note Added: 0025094 | |
2025-06-03 12:04 | Christian Grothoff | Note Added: 0025095 | |
2025-06-03 12:16 | Christian Grothoff | Note Added: 0025097 | |
2025-06-03 12:19 | Christian Grothoff | Note Added: 0025098 | |
2025-06-03 12:25 | Christian Grothoff | Note Added: 0025099 | |
2025-06-03 12:27 | Christian Grothoff | Note Added: 0025100 | |
2025-06-03 12:27 | Christian Grothoff | Note Edited: 0025100 | |
2025-06-03 12:35 | Christian Grothoff | Note Added: 0025101 | |
2025-06-03 12:48 | Christian Grothoff | Note Added: 0025102 | |
2025-06-03 12:52 | Christian Grothoff | Note Added: 0025103 | |
2025-06-03 12:52 | Christian Grothoff | File Added: deb.tar | |
2025-06-03 16:20 | Christian Grothoff | Summary | add support for automated sanction list processing [4d] / 8.6.25 => add support for automated sanction list processing [3d] / 8.6.25 |
2025-06-03 16:22 | Christian Grothoff | Note Added: 0025107 | |
2025-06-03 16:23 | Christian Grothoff | Note Edited: 0025107 | |
2025-06-05 11:07 | Christian Grothoff | Note Edited: 0025052 | |
2025-06-05 11:08 | Christian Grothoff | Note Edited: 0025053 | |
2025-06-05 11:08 | Christian Grothoff | Note Edited: 0025095 | |
2025-06-05 11:09 | Christian Grothoff | Note Edited: 0025102 | |
2025-06-05 11:09 | Christian Grothoff | Note Edited: 0025103 | |
2025-06-05 11:50 | Christian Grothoff | Note Edited: 0025099 | |
2025-06-05 11:51 | Christian Grothoff | Note Edited: 0025098 | |
2025-06-05 11:52 | Christian Grothoff | Note Edited: 0025097 | |
2025-06-05 12:03 | Christian Grothoff | Relationship added | parent of 0010072 |
2025-06-05 12:14 | Christian Grothoff | Note Added: 0025146 | |
2025-06-05 12:14 | Christian Grothoff | Summary | add support for automated sanction list processing [3d] / 8.6.25 => add support for automated sanction list processing [2d] / 8.6.25 |