View Issue Details

IDProjectCategoryView StatusLast Update
0009053Talerexchangepublic2025-06-05 12:14
ReporterChristian Grothoff Assigned ToChristian Grothoff  
PriorityhighSeverityfeatureReproducibilityN/A
Status assignedResolutionopen 
Platformi7OSDebian GNU/LinuxOS Versionsqueeze
Product Versiongit (master) 
Target Version1.0 stretch goals 
Summary0009053: add support for automated sanction list processing [2d] / 8.6.25
DescriptionWe first need to get our hands on an actual sanction list so we know what the format looks like.
Tagscompliance

Relationships

parent of 0010072 confirmed some inputs fail to yield score, depending on sanction list and input 
Not all the children of this issue are yet resolved or closed.

Activities

Christian Grothoff

2025-01-05 16:24

manager   ~0023938

Last edited: 2025-01-12 09:17

Todo:
- actual helper(s) to evaluate sanctions list against attributes
- testing!
- improve threshold formulas

Christian Grothoff

2025-05-09 09:23

manager   ~0024860

Vint has delivered: https://git.disroot.org/lnrs/kycheck

Christian Grothoff

2025-05-29 15:56

manager   ~0025052

Last edited: 2025-06-05 11:07

I've compiled the code, several issues:

1) Downloaded the consolidated-list_2025-05-15.xml from https://www.sesam.search.admin.ch/sesam-search-web/pages/downloadXmlGesamtliste.xhtml?lang=en&action=downloadXmlGesamtlisteAction, then imported the 37 MB file via
~/.local/bin/kycheck --input ~/Downloads/consolidated-list_2025-05-15.xml
Result: kyccheck takes 60s compute time on a Threadripper 1950X and consumes 1000 GB of virtual and 6 GB of actual RAM on-load. WTF? Note that xmllint parses the same XML in less than 1s on this system. There is also no conceivable reason to use significantly more RAM than the size of the list, so 64 MB would be fine, but 6 GB is out-of-this-world! Note that we will intend to deploy GNU Taler on systems with less memory than this, you're using more than everything else combined! Plus it is awfully slow.

2) Next I tried to use it, and pasted '{"company_name" : "Eindhoven University of Technology", "id" : "abcdef", "address" : { "country" : "NL", "street_name" : "Groene Loper", "street_number" : "3", "zipcode" : "5612 AE", "town_location" : "Eindhoven" } }' from your example input into STDIN. The result was:
'"Could not decode JSON (\"Error in $: not enough input\"), please try again"'. Note the malformed error message, and the fact that I'm not getting what was promised either... I also tried just giving '{}' as the input (empty JSON object), same error. So it's definitively not a syntax error in the input.

3) I made one tiny modification to your build system, setting 'enable: false' in stack.yaml. After that, I could kind-of build on Debian stable just using 'stack build' (without NixOS installed). It still insisted on re-installing (!) the same version (!) of ghc, so I'm still not happy with the build system as the current state is not reasonable for creating Debian packages. Update: Figured out how to build a nice Debian package in 4c475a4..ec1fdaa -- did not need any significant changes to the build system, just command-line arguments to override. Nice!

Christian Grothoff

2025-05-29 16:15

manager   ~0025053

Last edited: 2025-06-05 11:08

4) The project has some insane dependencies. Like crypton, building Twofish and other cryptographic primitives. An HTTP client library. CBOR. ASN1. Socks5 support. iproute. blaze-html, zlib. In the end, you have a 53 MB binary (plus external C code) for 3000 lines of code! This is just wrong on many levels, impossible supply chain. I understand type-safety is easily confused with "safe dependency", but this is too much by far for what the project needs.

5) Adding "system-ghc: true" to stack.yaml seems to convince it to use the local compiler. ;-). Update: 4c475a4..ec1fdaa uses just command-line arguments to override to make the build work nicely on Debian.

Christian Grothoff

2025-05-29 16:16

manager   ~0025054

6) Compiler warning:
kycheck/app/Main.hs:76:85: warning: [GHC-18042] [-Wtype-defaults]
    • Defaulting the type variable ‘a0’ to type ‘Integer’ in the following constraints
        (Show a0) arising from a use of ‘show’ at app/Main.hs:76:85-88
        (Integral a0) arising from a use of ‘floor’ at app/Main.hs:76:92-96
    • In the first argument of ‘($)’, namely ‘show’
      In the second argument of ‘(++)’, namely
        ‘(show $ floor $ diffUTCTime start (UTCTime age 0))’
      In the second argument of ‘($)’, namely
        ‘"Seconds since epoch: "
           ++ (show $ floor $ diffUTCTime start (UTCTime age 0))’
   |
76 | Just age -> print $ "Seconds since epoch: " ++ (show $ floor $ diffUTCTime start (UTCTime age 0))
   |

Christian Grothoff

2025-06-03 11:49

manager   ~0025094

Notes on how to build the Debian package:

# apt install -t testing cdbs haskell-stack debhelper dhall haskell-devscripts-minimal ghc
$ dpkg-buildpackage -rfakeroot -b -uc -us

(with my debian/ folder). This worked on my office system, now locally I get "Failed to find C++ standard library", so probably some dependency is still missing in the list...

Christian Grothoff

2025-06-03 12:04

manager   ~0025095

Last edited: 2025-06-05 11:08

Installing g++-14 worked, not sure why g++-12 wasn't enough... Update: Made g++-14 a dependency for the Debian build.

Christian Grothoff

2025-06-03 12:16

manager   ~0025097

Last edited: 2025-06-05 11:52

(7) When giving --silent, it still prints '"Seconds since epoch: 26647950"'. Not sure why. This also goes to stdout, which is quite bad as it'd break the parser that expects to receive results in JSON. We should make sure to log at best to stderr and make sure stdout is strictly limited to the JSON result.

Update: logging changed to stderr, not sure what the output is supposed to tell us still. But harmless.

(8) It expects "quit" to be entered to, well, quit. Which I guess is OK, but it should also just quit on CTRL-D (end of stream). Which it does, except with an error message (even on 'silent'): 'kycheck: <stdin>: hGetLine: end of file'. I'm not sure we need the "quit" feature, and I think it would be nicer (given that the main interaction with this tool will not be by humans) if these outputs were removed (or at least left on only in --debug mode or so).

Update: now implemented to exit cleanly on CTRL-D (and always on invalid input).

Christian Grothoff

2025-06-03 12:19

manager   ~0025098

Last edited: 2025-06-05 11:51

(9) "Could not decode JSON (\"Unexpected end-of-input, expecting JSON value\"), please try again" is also logged to stdout instead of stderr. Here I just submitted "new line + CTRL-D". I'm OK with the code insisting on each line being a JSON input, but we should figure out how to do error handling nicely. The application will expect a JSON result per JSON input on stdout, so probably a good way to do it would be to log a human-readable error to stderr, and define some JSON format for the error on stdout, like: '{"status":"error", "code":42, "hint":"..."}'. We could then extend the normal output format to something like '{"status" : "success", match_quality = 0.85357136, confidence = 0.9311688, expiration = 0, reference = 73508}'.

Update: logging fixed to go to stderr, plus CTRL-D is now implemented, plus output is no longer in JSON-ish but in the sscanf() format expected by taler-exchange-sanctionscheck.

Christian Grothoff

2025-06-03 12:25

manager   ~0025099

Last edited: 2025-06-05 11:50

(10) I now got something close to the expected output for 'normal' person:
>>
$ /tmp/bin/kycheck --silent --input files/consolidated-list_2024-07-30.xml 2> /dev/null
"Seconds since epoch: 26648230"
{"full_name" : "Maria Consuela", "last_name" : "", "address" : { "country" : "GT", "street_name" : "Unknown", "street_number" : "", "zipcode" : "" }, "birthdate" : "1953-06-23", "nationality" : "GT", "national_id" : "" }
Score {match_quality = 0.85357136, confidence = 0.9311688, expiration = 0, reference = 73508}
<<
what is not great is the "Score" prefix, that just makes parsing harder. Furthermore, the syntax is only *almost* JSON, we should use
'{"status":"success","match_quality": 0.85357136, "confidence": 0.9311688, "expiration": 0, "reference": 73508}' so that the main process can run this through a regular JSON parser and not something custom.
UPDATE: I've checked, and the C code actually used

   sscanf (buf,
                "%lf %lf %llu %1023s",
                &rating,
                &confidence,
                &expire,
                best_match))
 
 for parsing the robocop output. Git ec1fdaa..53ec32c modified the robocop output to match that requirement.

(11) I also wonder what the "expiration" of 0 means here. "never"? Already expired? What's the unit?

Update: Unit is in seconds, changed C code to interpret 0 as "forever", which is I guess the conservative solution. We should still figure out if 0 expiration is a bug or simply that the sanction list fails to say.

Christian Grothoff

2025-06-03 12:27

manager   ~0025100

Last edited: 2025-06-03 12:27

(2) Is now better for normal users (see above), but I still cannot get the business match to work, I tried:

$ /tmp/bin/kycheck --silent --input files/consolidated-list_2024-07-30.xml 2> /dev/null
{"company_name" : "Eindhoven University of Technology", "id" : "abcdef", "address" : { "country" : "NL", "street_name" : "Groene Loper", "street_number" : "3", "zipcode" : "5612 AE", "town_location" : "Eindhoven" } }
"Seconds since epoch: 26648725"

and there is simply no output at all (no error, nothing). The result is the same even with stderr / without --silent.

Christian Grothoff

2025-06-03 12:35

manager   ~0025101

(1) was explained by Michiel to be the pre-processing / building FSMs for fast matching later, which makes sense (especially now that I can see the runtime even with the pre-processing). I thought the code was just parsing the XML, so that explains the memory consumption and the loading time. On the TOPS production system, I checked and we have more than enough RAM, and GLS hasn't yet indicated that they will want us to do sanction enforcement, so for now the RAM usage is no real concern. I'm still uncomfortable with the dependency chain (but the current binary is MUCH smaller already, 22 MB, massive reduction!). But I think this can easily be fixed by:

(a) splitting the cool (and in principle generic!) matching logic from the (complex) XML parsing: convert Swiss XML to some "internal" JSON format first -- which can be done away from the production system and we will care much less about the dependency chain here -- and then
(b) load our internal JSON into the production matching logic which ONLY does the approximate matching and then only needs JSON inputs (stdin, --input) and produces JSON outputs and thus doesn't *need* an XML parser as a dependency anymore.

This would also have the advantage that if we in the future get sanction lists in other countries in other formats (CSV, JSON, different XML, who knows!) we can just write a converter to the "internal" JSON format (which should be documented...) and can keep the interesting core logic unchanged.

Christian Grothoff

2025-06-03 12:48

manager   ~0025102

Last edited: 2025-06-05 11:09

(12) I think we should (re)consider the binary name. "kycheck" is a bit short and IMO not memorable enough. Given that it is about automated enforcement of sanctions, "robocop" comes to mind (or, if you perfer, "murphy", which via Murphy's law also gives the idea that things could go wrong / that the matching is statistical). robocop is not yet taken, which is also good, there is a libmurphy in Debian, which is a minor concern for that name. If/when we do the split, we also need a 2nd name for the XML 2 JSON converter, but there I'd go for something purely functional, such as "robocop-helper-ch-xml-converter". WDYT?

Update: in absence of a response, I've changed everything to "robocop".

Christian Grothoff

2025-06-03 12:52

manager   ~0025103

Last edited: 2025-06-05 11:09

Starting point for Debian package (to be renamed) attached. We should also move the git to git.taler.net, will need Vint's ssh public key for that. Update: Found Vint's key, robocop.git created.
deb.tar (51,200 bytes)

Christian Grothoff

2025-06-03 16:22

manager   ~0025107

Last edited: 2025-06-03 16:23

Worked on taler-exchange-sanctionscheck today, added:
- options for configurable thresholds
- incremental processing, storing last seen row ID
- background mode waiting for notfiy
- reset command-line option to work on new sanction list

Todo:
- DB notify is missing on kyc attribute insertion
- testing of taler-exchange-sanctionscheck (manual, automated, etc.)
- integration test with Haskell tool
- ansible integration

Christian Grothoff

2025-06-05 12:14

manager   ~0025146

There seems to be only one issue (0010072) left before Robocop is usable. Main Todos here:
- integration test
- ansible integration

Issue History

Date Modified Username Field Change
2024-08-09 20:29 Christian Grothoff New Issue
2024-08-09 20:29 Christian Grothoff Status new => assigned
2024-08-09 20:29 Christian Grothoff Assigned To => Christian Grothoff
2024-08-19 09:01 Christian Grothoff Target Version 0.14 => 1.0
2024-08-23 00:24 Christian Grothoff Target Version 1.0 => 1.0 stretch goals
2024-08-24 10:40 Christian Grothoff Summary add support for sanction lists => add support for sanction lists [5d]
2024-09-14 00:57 Christian Grothoff Priority urgent => high
2025-01-05 16:24 Christian Grothoff Note Added: 0023938
2025-01-05 16:24 Christian Grothoff Summary add support for sanction lists [5d] => add support for sanction lists [4d]
2025-01-05 23:22 Christian Grothoff Note Edited: 0023938
2025-01-12 09:17 Christian Grothoff Note Edited: 0023938
2025-04-17 22:21 Christian Grothoff Tag Attached: compliance
2025-05-07 16:51 Florian Dold Summary add support for sanction lists [4d] => add support for automated sanction list processing [4d]
2025-05-09 09:23 Christian Grothoff Note Added: 0024860
2025-05-29 15:56 Christian Grothoff Note Added: 0025052
2025-05-29 16:15 Christian Grothoff Note Added: 0025053
2025-05-29 16:16 Christian Grothoff Note Added: 0025054
2025-06-03 01:31 Christian Grothoff Summary add support for automated sanction list processing [4d] => add support for automated sanction list processing [4d] / 8.6.25
2025-06-03 11:49 Christian Grothoff Note Added: 0025094
2025-06-03 12:04 Christian Grothoff Note Added: 0025095
2025-06-03 12:16 Christian Grothoff Note Added: 0025097
2025-06-03 12:19 Christian Grothoff Note Added: 0025098
2025-06-03 12:25 Christian Grothoff Note Added: 0025099
2025-06-03 12:27 Christian Grothoff Note Added: 0025100
2025-06-03 12:27 Christian Grothoff Note Edited: 0025100
2025-06-03 12:35 Christian Grothoff Note Added: 0025101
2025-06-03 12:48 Christian Grothoff Note Added: 0025102
2025-06-03 12:52 Christian Grothoff Note Added: 0025103
2025-06-03 12:52 Christian Grothoff File Added: deb.tar
2025-06-03 16:20 Christian Grothoff Summary add support for automated sanction list processing [4d] / 8.6.25 => add support for automated sanction list processing [3d] / 8.6.25
2025-06-03 16:22 Christian Grothoff Note Added: 0025107
2025-06-03 16:23 Christian Grothoff Note Edited: 0025107
2025-06-05 11:07 Christian Grothoff Note Edited: 0025052
2025-06-05 11:08 Christian Grothoff Note Edited: 0025053
2025-06-05 11:08 Christian Grothoff Note Edited: 0025095
2025-06-05 11:09 Christian Grothoff Note Edited: 0025102
2025-06-05 11:09 Christian Grothoff Note Edited: 0025103
2025-06-05 11:50 Christian Grothoff Note Edited: 0025099
2025-06-05 11:51 Christian Grothoff Note Edited: 0025098
2025-06-05 11:52 Christian Grothoff Note Edited: 0025097
2025-06-05 12:03 Christian Grothoff Relationship added parent of 0010072
2025-06-05 12:14 Christian Grothoff Note Added: 0025146
2025-06-05 12:14 Christian Grothoff Summary add support for automated sanction list processing [3d] / 8.6.25 => add support for automated sanction list processing [2d] / 8.6.25