Skip to content

Conversation

@LauLauThom
Copy link
Contributor

Hi guys,

live from the deNBI hackathon here, I have been playing with reading an entity referencing a subcrate i.e traversing the graph from the top crate to a subcrate, as suggested in the 1.2 spec.

I am proposing a simple approach here, with a new Subcrate class extending the Dataset class.
I defined this class in the main rocrate.py file, in models.py it would cause circular dependencies.

This would allow things like

main_crate = ROCrate(test_data_dir / "crate_with_subcrate")
subcrate = main_crate.get("subcrate")
subfile = subcrate.get("subfile.txt")
# or 
subfile = subcrate["hasPart"][0]

(see added tests too)

at this point I am mostly interested to know if you think that could be a viable approach before going further.

The implementation is such that the subcrate is only loaded when accessing some of its attribute, to avoid potentially loading large amount of metadata, as one purpose of the subcrate is also to reduce the amount of information in the main crate.

@LauLauThom LauLauThom marked this pull request as draft December 3, 2025 08:46
@elichad
Copy link
Contributor

elichad commented Dec 3, 2025

I had a quick look and I quite like this implementation at first glance.

Some extra suggestions:

  • log a message when a subcrate is parsed, to make clear where metadata is being retrieved from
  • could be useful to have an ROCrate.subcrates property, analogous to ROCrate.data_entities - a list of subcrate entities that is built while parsing the main crate. That list can then be used for iteration even if you don't know the subcrate ids (you could even do this recursively if you were willing to load all the crates into memory)

@LauLauThom
Copy link
Contributor Author

LauLauThom commented Dec 3, 2025

Thanks for the quick feedback !

I have pushed a few more commits to allow things like subfile = main_crate.get("subcrate/subfile.txt") and also implemented the ROCrate.subcrates property as suggested.
I also tested with another nested level i.e another subcrate in the subcrate 😅, and that works too.

For the logging, it seems it's not used in the codebase yet, should I just use the logging package, initializing a default logger instance like logger = logging.getLogger(__name__) ?

Now I am wondering if a Subcrate should be a valid RO-Crate object too.
I think it should not be a problem, only a bit verbose to implement all required methods (see Subcrate.get_entities() for an example)

Happy to discuss this in the drop-in call tomorrow 😉

EDIT : I also commited a .pre-commit-config.yaml file to help enforcing flake8 syntax, could be removed before merging of course

@simleo
Copy link
Collaborator

simleo commented Dec 4, 2025

I went through the code and did some testing, which exposed problems:

With parse_subcrate left to False all seems well:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate")
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")
'https://w3id.org/ro/crate/'
>>> crate.write("/tmp/crate")

With parse_subcrate set to True:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")  # this fails
>>> crate.write("/tmp/crate")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 567, in write
    writable_entity.write(base_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 97, in write
    super()._write_from_stream(write_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/file.py", line 63, in _write_from_stream
    for _, chunk in self.stream():
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 90, in stream
    yield self.id, str.encode(json.dumps(content, indent=4, sort_keys=True), encoding='utf-8')
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type File is not JSON serializable

I think this has something to do with the hacking of ._jsonld in the Subcrate class. More generally, I'm concerned with the mixing of interfaces (some Dataset and some ROCrate behavior) that happens in the Subcrate class. The role of Subcrate is not clear cut, it sounds very weird that a Subcrate object has a subcrate attribute:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> subcrate = crate.get("subcrate/")
>>> subcrate.subcrate
>>> subcrate.get("conformsTo")
>>> subcrate.subcrate
<rocrate.rocrate.ROCrate object at 0x7b6edf975160>
>>> subcrate.subcrate.subcrate_entities
[<subsubcrate/ Dataset>]
>>> subcrate.get("conformsTo")
>>> 

The new structure looks very confusing. I did not have the time to work on it yet, but I have the feeling that subcrate support could / should be done without using a special Subcrate class: a subcrate should be of type ROCrate, like the main one, while the subcrate attribute might be added directly to the Dataset class.

The main thing I would like to point out about the current solution is that it hacks critical sections of the code, so getting it right is harder than it looks.

@LauLauThom
Copy link
Contributor Author

LauLauThom commented Dec 4, 2025

Thanks for the feedback Simone, I added a couple changes.

First I made sure any attribute on the original dataset entity (such as conformsTo) is conserved.
I also renamed the Subcrate.subcrate attribute to Subcrate._crate to avoid confusion, and added a get_crate getter for this attribute.

I also changed a bit the behaviour, such that the Subcrate entity behaves more like a Dataset, removing for instance the Subcrate.get_entities.

@LauLauThom LauLauThom marked this pull request as ready for review December 10, 2025 09:48
@LauLauThom
Copy link
Contributor Author

I added a couple more tests to cover the writing of the crate.
Happy to discuss it next week at the EU drop-in call ;)

@simleo
Copy link
Collaborator

simleo commented Dec 11, 2025

6ea62fe avoids modification of the main crate Dataset's metadata when loading a subcrate. Before this change, when reading and then writing an RO-Crate with loaded subcrates, the new top-level metadata file was:

{
    "@context": "https://w3id.org/ro/crate/1.1/context",
    "@graph": [
        {
            "@id": "./",
            "@type": "Dataset",
            "datePublished": "2025-12-02T08:39:54+00:00",
            "description": "A RO-Crate containing a subcrate",
            "hasPart": [
                {
                    "@id": "file.txt"
                },
                {
                    "@id": "subcrate/"
                }
            ],
            "license": "https://spdx.org/licenses/MIT.html",
            "name": "Top-level crate with subcrate"
        },
        {
            "@id": "ro-crate-metadata.json",
            "@type": "CreativeWork",
            "about": {
                "@id": "./"
            },
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/1.1"
            }
        },
        {
            "@id": "file.txt",
            "@type": "File"
        },
        {
            "@id": "subcrate/",
            "@type": "Dataset",
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/"
            },
            "hasPart": [
                {
                    "@id": "subfile.txt"
                },
                {
                    "@id": "subsubcrate/"
                }
            ]
        }
    ]
}

which has several problems:

  • "subfile.txt" and "subsubcrate/" are not valid ids in the main crate: "subcrate/subfile.txt" and "subcrate/subsubcrate/" would be the correct ones, respectively
  • There are no entities corresponding to those ids
  • The output metadata of a copied crate is different from the input one

@simleo
Copy link
Collaborator

simleo commented Dec 12, 2025

d2649a2 changes crate membership testing when checking for unlisted files to write: now there's a __contains__ method that enables some_id in crate tests, returning True if some_id appears in the metadata. The previous code used dereference, but after it was changed in this PR it's not usable anymore because it "always works" for subcrate members (since it triggers subcrate loading). Additionally, dereference is now more computationally expensive. With the functionality of write and write_zip restored, writing crates now works out of the box, without the need to override anything in Subcrate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants