-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Steps to reproduce:
-
download PDFs such as the attached Goldberg_gwu_0075M_10226.pdf (or any PDF you find by searching the Internet for the string
きどっひはびはてぴ; you'll find a large number of PDF documents in institutional repositories that use FITS/Jhove for characterization, probably mostly via Hyrax/Samvera) -
run
jhoveon them
Expected behavior:
- Jhove reports correct title, in this case
Microsoft Word - Speech-Language Pathologists’ Self-Reported Definition of Cluttering and Confidence in Assessment of Cluttering
Actual behavior:
- Jhove reports title as a CJK nonsense string beginning with
きどっひはびはてぴ〠しはひつ〠〭〠
(Note that exiftool reports the title correctly.)
Notes:
Similar behavior seems to happen with PDFs produced in different ways and with different issues. For instance, 4912.pdf.pdf clearly has some other things going on—macOS Preview gives the title as Microsoft Word - Dissertation_Master_6_12-21-11.doc, while exiftool (possibly reading RDF from embedded <x:xmpmeta>?) gives title, creator, and producer as (different) strings of escape codes. Jhove, however, gives the title as a (different) CJK nonsense string with the same initial きどっひはびはてぴ〠しはひつ〠〭〠, and also gives author and producer as CJK nonsense strings, while macOS Preview and exiftool both give the author as Sam.
Tested with jhove 1.32.1 on macOS, as well as 1.26.1 via the FITSservlet Docker image.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status