Skip to content

JHove extracts nonsense CJK characters instead of title, other metadata from some PDFs #1032

@dmolesUC

Description

@dmolesUC

Steps to reproduce:

  1. download PDFs such as the attached Goldberg_gwu_0075M_10226.pdf (or any PDF you find by searching the Internet for the string きどっひはびはてぴ; you'll find a large number of PDF documents in institutional repositories that use FITS/Jhove for characterization, probably mostly via Hyrax/Samvera)

  2. run jhove on them

Expected behavior:

  • Jhove reports correct title, in this case Microsoft Word - Speech-Language Pathologists’ Self-Reported Definition of Cluttering and Confidence in Assessment of Cluttering

Actual behavior:

  • Jhove reports title as a CJK nonsense string beginning with きどっひはびはてぴ〠しはひつ〠〭〠

(Note that exiftool reports the title correctly.)

Notes:

Similar behavior seems to happen with PDFs produced in different ways and with different issues. For instance, 4912.pdf.pdf clearly has some other things going on—macOS Preview gives the title as Microsoft Word - Dissertation_Master_6_12-21-11.doc, while exiftool (possibly reading RDF from embedded <x:xmpmeta>?) gives title, creator, and producer as (different) strings of escape codes. Jhove, however, gives the title as a (different) CJK nonsense string with the same initial きどっひはびはてぴ〠しはひつ〠〭〠, and also gives author and producer as CJK nonsense strings, while macOS Preview and exiftool both give the author as Sam.

Tested with jhove 1.32.1 on macOS, as well as 1.26.1 via the FITSservlet Docker image.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priority issues to be scheduled in a future release

    Type

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions