View Issue Details

IDProjectCategoryView StatusLast Update
0008620libextractorlibextractor main librarypublic2024-04-17 20:26
Reporterapteryx Assigned ToChristian Grothoff  
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
Product Version1.13 
Target Version1.14Fixed in Version1.14 
Summary0008620: libextractor searches tidy-html include as <tidy/tidy.h>; is packaged simply as <tidy.h>
DescriptionThe tidy-html detection in configure.ac is flawed; it only looks for a header named <tidy/tidy.h>. A recent version of tidy-html (5.8.0) includes get installed by its cmake build system like so:

$ find /gnu/store/pbabjp0f5z86dhb064hg2abqkw6wx2r9-tidy-html-5.8.0/include
/gnu/store/pbabjp0f5z86dhb064hg2abqkw6wx2r9-tidy-html-5.8.0/include
/gnu/store/pbabjp0f5z86dhb064hg2abqkw6wx2r9-tidy-html-5.8.0/include/tidybuffio.h
/gnu/store/pbabjp0f5z86dhb064hg2abqkw6wx2r9-tidy-html-5.8.0/include/tidy.h
/gnu/store/pbabjp0f5z86dhb064hg2abqkw6wx2r9-tidy-html-5.8.0/include/tidyenum.h
/gnu/store/pbabjp0f5z86dhb064hg2abqkw6wx2r9-tidy-html-5.8.0/include/tidyplatform.h

Thus the check/usage should be simply using <tidy.h> rather than <tidy/tidy.h>. See here tidy-html build system here: https://github.com/htacg/tidy-html5/blob/d08ddc2860aa95ba8e301343a30837f157977cba/CMakeLists.txt#L361 to see that the headers are indeed to be installed directly under 'include/', not 'tidy/'.

tidy/tidy.h should still be considered for older versions.
TagsNo tags attached.

Activities

apteryx

2024-03-13 03:40

reporter   ~0021882

Patch sent to bug-libextractor@gnu.org, with Message-ID 20240313023849.16390-1 ... ("html_extractor: Add support for modern tidy-html.")

Christian Grothoff

2024-04-10 23:57

manager   ~0022201

Hmm. I didn't get the patch. Maybe the list / alias is broken? Care to attach it here?

apteryx

2024-04-17 20:05

reporter   ~0022268

Oh, not sure what went wrong. Perhaps my email is stuck in the moderation queue or similar?
0001-html_extractor-Add-support-for-modern-tidy-html.patch (2,414 bytes)   
From 1fc6daaeaf829fb941a176831c011888a73c43b9 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Mon, 11 Mar 2024 09:36:26 -0400
Subject: [PATCH] html_extractor: Add support for modern tidy-html.

* configure.ac: Use PKG_PROG_PKG_CONFIG to initialize pkg-config detection.
<tidy>: Check for library via pkg-config.
* src/plugins/html_extractor.c: Standardize tidy include file names.
---
 configure.ac                 | 28 +++++++++-------------------
 src/plugins/html_extractor.c |  4 ++--
 2 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/configure.ac b/configure.ac
index d17ff39..e89d70c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -176,6 +176,8 @@ AS_CASE(["$target_os"],
 
 AM_ICONV
 
+PKG_PROG_PKG_CONFIG()
+
 # We define the paths here, because MinGW/GCC expands paths
 # passed through the command line ("-DLOCALEDIR=..."). This would
 # lead to hard-coded paths ("C:\mingw\mingw\bin...") that do
@@ -424,25 +426,13 @@ AC_CHECK_LIB(magic, magic_open,
    AM_CONDITIONAL(HAVE_MAGIC, false))],
   AM_CONDITIONAL(HAVE_MAGIC, false))
 
-AC_MSG_CHECKING(for tidyNodeGetValue -ltidy)
-AC_LANG_PUSH(C++)
-SAVED_LIBS=$LIBS
-LIBS="$LIBS -ltidy"
-AC_LINK_IFELSE(
-  [AC_LANG_PROGRAM([[#include <tidy/tidy.h>]],
-    [[ Bool b = tidyNodeGetValue (NULL, NULL, NULL); ]])],
-  [AC_MSG_RESULT(yes)
-   AM_CONDITIONAL(HAVE_TIDY, true)
-   AC_DEFINE(HAVE_TIDY,1,[Have tidyNodeGetValue in libtidy])],
-  [AC_MSG_RESULT(no)
-   AM_CONDITIONAL(HAVE_TIDY, false)])
-LIBS=$SAVED_LIBS
-AC_LANG_POP(C++)
-
-# restore LIBS
-LIBS=$LIBSOLD
-
-
+dnl tidyNodeGetValue was already available in 5.0.0, released in 2015.
+PKG_CHECK_MODULES([TIDY], [tidy >= 5.0.0],
+ [AC_DEFINE(HAVE_TIDY, 1, [Have tidy])
+  AM_CONDITIONAL(HAVE_TIDY, true)],
+ [AM_CONDITIONAL(HAVE_TIDY, false)])
+CFLAGS="$CFLAGS $TIDY_CFLAGS"
+LIBS="$LIBS $TIDY_LIBS"
 
 # should 'make check' run tests?
 AC_MSG_CHECKING(whether to run tests)
diff --git a/src/plugins/html_extractor.c b/src/plugins/html_extractor.c
index 5ebf97b..88100d3 100644
--- a/src/plugins/html_extractor.c
+++ b/src/plugins/html_extractor.c
@@ -26,8 +26,8 @@
 #include "platform.h"
 #include "extractor.h"
 #include <magic.h>
-#include <tidy/tidy.h>
-#include <tidy/tidybuffio.h>
+#include <tidy.h>
+#include <tidybuffio.h>
 
 /**
  * Mapping of HTML META names to LE types.

base-commit: a75f40b64b5868967c95ea214e8eaac4f7088b23
-- 
2.41.0

Christian Grothoff

2024-04-17 20:26

manager   ~0022270

I had to patch around some more to keep it *also* working on Debian. Result is in a75f40b..d68210a, should now work for *both* types of installations.

Issue History

Date Modified Username Field Change
2024-03-10 03:45 apteryx New Issue
2024-03-13 03:40 apteryx Note Added: 0021882
2024-04-10 23:57 Christian Grothoff Note Added: 0022201
2024-04-17 20:05 apteryx Note Added: 0022268
2024-04-17 20:05 apteryx File Added: 0001-html_extractor-Add-support-for-modern-tidy-html.patch
2024-04-17 20:26 Christian Grothoff Note Added: 0022270
2024-04-17 20:26 Christian Grothoff Assigned To => Christian Grothoff
2024-04-17 20:26 Christian Grothoff Status new => resolved
2024-04-17 20:26 Christian Grothoff Resolution open => fixed
2024-04-17 20:26 Christian Grothoff Fixed in Version => 1.14
2024-04-17 20:26 Christian Grothoff Target Version => 1.14