suggestion for reading subcrates #244

LauLauThom · 2025-12-02T15:53:47Z

Hi guys,

live from the deNBI hackathon here, I have been playing with reading an entity referencing a subcrate i.e traversing the graph from the top crate to a subcrate, as suggested in the 1.2 spec.

I am proposing a simple approach here, with a new Subcrate class extending the Dataset class.
I defined this class in the main rocrate.py file, in models.py it would cause circular dependencies.

This would allow things like

main_crate = ROCrate(test_data_dir / "crate_with_subcrate")
subcrate = main_crate.get("subcrate")
subfile = subcrate.get("subfile.txt")
# or 
subfile = subcrate["hasPart"][0]

(see added tests too)

at this point I am mostly interested to know if you think that could be a viable approach before going further.

The implementation is such that the subcrate is only loaded when accessing some of its attribute, to avoid potentially loading large amount of metadata, as one purpose of the subcrate is also to reduce the amount of information in the main crate.

elichad · 2025-12-03T09:21:49Z

I had a quick look and I quite like this implementation at first glance.

Some extra suggestions:

log a message when a subcrate is parsed, to make clear where metadata is being retrieved from
could be useful to have an ROCrate.subcrates property, analogous to ROCrate.data_entities - a list of subcrate entities that is built while parsing the main crate. That list can then be used for iteration even if you don't know the subcrate ids (you could even do this recursively if you were willing to load all the crates into memory)

LauLauThom · 2025-12-03T14:30:44Z

Thanks for the quick feedback !

I have pushed a few more commits to allow things like subfile = main_crate.get("subcrate/subfile.txt") and also implemented the ROCrate.subcrates property as suggested.
I also tested with another nested level i.e another subcrate in the subcrate 😅, and that works too.

For the logging, it seems it's not used in the codebase yet, should I just use the logging package, initializing a default logger instance like logger = logging.getLogger(__name__) ?

Now I am wondering if a Subcrate should be a valid RO-Crate object too.
I think it should not be a problem, only a bit verbose to implement all required methods (see Subcrate.get_entities() for an example)

Happy to discuss this in the drop-in call tomorrow 😉

EDIT : I also commited a .pre-commit-config.yaml file to help enforcing flake8 syntax, could be removed before merging of course

simleo · 2025-12-04T09:49:01Z

I went through the code and did some testing, which exposed problems:

With parse_subcrate left to False all seems well:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate")
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")
'https://w3id.org/ro/crate/'
>>> crate.write("/tmp/crate")

With parse_subcrate set to True:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")  # this fails
>>> crate.write("/tmp/crate")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 567, in write
    writable_entity.write(base_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 97, in write
    super()._write_from_stream(write_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/file.py", line 63, in _write_from_stream
    for _, chunk in self.stream():
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 90, in stream
    yield self.id, str.encode(json.dumps(content, indent=4, sort_keys=True), encoding='utf-8')
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type File is not JSON serializable

I think this has something to do with the hacking of ._jsonld in the Subcrate class. More generally, I'm concerned with the mixing of interfaces (some Dataset and some ROCrate behavior) that happens in the Subcrate class. The role of Subcrate is not clear cut, it sounds very weird that a Subcrate object has a subcrate attribute:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> subcrate = crate.get("subcrate/")
>>> subcrate.subcrate
>>> subcrate.get("conformsTo")
>>> subcrate.subcrate
<rocrate.rocrate.ROCrate object at 0x7b6edf975160>
>>> subcrate.subcrate.subcrate_entities
[<subsubcrate/ Dataset>]
>>> subcrate.get("conformsTo")
>>>

The new structure looks very confusing. I did not have the time to work on it yet, but I have the feeling that subcrate support could / should be done without using a special Subcrate class: a subcrate should be of type ROCrate, like the main one, while the subcrate attribute might be added directly to the Dataset class.

The main thing I would like to point out about the current solution is that it hacks critical sections of the code, so getting it right is harder than it looks.

also fix flake8 with precommit

also prevents directly accessing items listed in subcrate under hasPart e.g subcrate.get("subfile.txt")

LauLauThom · 2025-12-04T16:50:57Z

Thanks for the feedback Simone, I added a couple changes.

First I made sure any attribute on the original dataset entity (such as conformsTo) is conserved.
I also renamed the Subcrate.subcrate attribute to Subcrate._crate to avoid confusion, and added a get_crate getter for this attribute.

I also changed a bit the behaviour, such that the Subcrate entity behaves more like a Dataset, removing for instance the Subcrate.get_entities.

LauLauThom · 2025-12-10T09:49:03Z

I added a couple more tests to cover the writing of the crate.
Happy to discuss it next week at the EU drop-in call ;)

simleo · 2025-12-11T15:58:28Z

6ea62fe avoids modification of the main crate Dataset's metadata when loading a subcrate. Before this change, when reading and then writing an RO-Crate with loaded subcrates, the new top-level metadata file was:

{
    "@context": "https://w3id.org/ro/crate/1.1/context",
    "@graph": [
        {
            "@id": "./",
            "@type": "Dataset",
            "datePublished": "2025-12-02T08:39:54+00:00",
            "description": "A RO-Crate containing a subcrate",
            "hasPart": [
                {
                    "@id": "file.txt"
                },
                {
                    "@id": "subcrate/"
                }
            ],
            "license": "https://spdx.org/licenses/MIT.html",
            "name": "Top-level crate with subcrate"
        },
        {
            "@id": "ro-crate-metadata.json",
            "@type": "CreativeWork",
            "about": {
                "@id": "./"
            },
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/1.1"
            }
        },
        {
            "@id": "file.txt",
            "@type": "File"
        },
        {
            "@id": "subcrate/",
            "@type": "Dataset",
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/"
            },
            "hasPart": [
                {
                    "@id": "subfile.txt"
                },
                {
                    "@id": "subsubcrate/"
                }
            ]
        }
    ]
}

which has several problems:

"subfile.txt" and "subsubcrate/" are not valid ids in the main crate: "subcrate/subfile.txt" and "subcrate/subsubcrate/" would be the correct ones, respectively
There are no entities corresponding to those ids
The output metadata of a copied crate is different from the input one

simleo · 2025-12-12T12:04:31Z

d2649a2 changes crate membership testing when checking for unlisted files to write: now there's a __contains__ method that enables some_id in crate tests, returning True if some_id appears in the metadata. The previous code used dereference, but after it was changed in this PR it's not usable anymore because it "always works" for subcrate members (since it triggers subcrate loading). Additionally, dereference is now more computationally expensive. With the functionality of write and write_zip restored, writing crates now works out of the box, without the need to override anything in Subcrate.

LauLauThom marked this pull request as draft December 3, 2025 08:46

LauLauThom added 16 commits December 4, 2025 17:13

first attempt

91b3d6e

move the Subcrate to rocrate main class

8203025

use get_norm_value instead of get + as_list

25b56d3

add support for get with rocrate

c21e0e0

add simple tests

0dbdb15

handle url for subcrates

ce2a024

add flag parse_subcrate

fbe0eee

fix missing flag parse_rocrate

bd02ee2

add subcrate_entities property

bf6f276

support get of subcrate entity from top crate

fd1fd53

also fix flake8 with precommit

load_subcrate as hidden function

9916641

add get_entities to subcrate

22b34f2

add Subcrate to test_data_entities

bf7e0ea

fix issue with nested crates

9e9a4cf

keep conformsTo in Subcrate

21556c8

use getter for inner crate

d98b22f

also prevents directly accessing items listed in subcrate under hasPart e.g subcrate.get("subfile.txt")

LauLauThom force-pushed the subcrate branch from 0d91be2 to d98b22f Compare December 4, 2025 16:16

LauLauThom added 2 commits December 4, 2025 17:41

remove get_entities from Subcrate

629bcfe

remove subcrate.get_entities from tests

771247d

LauLauThom added 2 commits December 5, 2025 11:21

implement crate writing

1368208

add test writing the subcrate

3c77ad2

LauLauThom marked this pull request as ready for review December 10, 2025 09:48

simleo added 2 commits December 11, 2025 12:27

test_write_subcrate: activate parse_subcrate

dfe45eb

don't modify the main crate's jsonld when loading a subcrate

6ea62fe

simleo added 3 commits December 12, 2025 08:35

no trailing slash in generic ro-crate profile, as per the spec

284c9ed

reindent metadata files to reduce diffs

ad51c86

don't use dereference to check for unlisted files

d2649a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

suggestion for reading subcrates #244

suggestion for reading subcrates #244

LauLauThom commented Dec 2, 2025

Uh oh!

elichad commented Dec 3, 2025

Uh oh!

LauLauThom commented Dec 3, 2025 •

edited

Loading

Uh oh!

simleo commented Dec 4, 2025

Uh oh!

LauLauThom commented Dec 4, 2025 •

edited

Loading

Uh oh!

LauLauThom commented Dec 10, 2025

Uh oh!

simleo commented Dec 11, 2025

Uh oh!

simleo commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suggestion for reading subcrates #244

Are you sure you want to change the base?

suggestion for reading subcrates #244

Conversation

LauLauThom commented Dec 2, 2025

Uh oh!

elichad commented Dec 3, 2025

Uh oh!

LauLauThom commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simleo commented Dec 4, 2025

Uh oh!

LauLauThom commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LauLauThom commented Dec 10, 2025

Uh oh!

simleo commented Dec 11, 2025

Uh oh!

simleo commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LauLauThom commented Dec 3, 2025 •

edited

Loading

LauLauThom commented Dec 4, 2025 •

edited

Loading