Skip to content

Conversation

@JesusMcCloud
Copy link
Contributor

@JesusMcCloud JesusMcCloud commented Jul 5, 2025

fixes #2975

This PR introduces structured CBOR encoding and decoding

Encoding from/to CborElement

Bytes can be decoded into an instance of CborElement with the [Cbor.decodeFromByteArray] function by either manually
specifying [CborElement.serializer()] or specifying [CborElement] as generic type parameter.
It is also possible to encode arbitrary serializable structures to a CborElement through [Cbor.encodeToCborElement].

Since these operations use the same code paths as regular serialization (but with specialized serializers), the config flags
behave as expected

Newly introduced CBOR-specific structures

  • [CborPrimitive] represents primitive CBOR elements, such as string, integer, float boolean, and null.
    CBOR byte strings are also treated as primitives
    Each primitive has a [value][CborPrimitive.value]. Depending on the concrete type of the primitive, it maps
    to corresponding Kotlin Types such as String, Int, Double, etc.
    Note that Cbor discriminates between positive ("unsigned") and negative ("signed") integers!
    CborPrimitive is itself an umbrella type (a sealed class) for the following concrete primitives:

    • [CborNull] mapping to a Kotlin null
    • [CborBoolean] mapping to a Kotlin Boolean
    • [CborInt] which is an umbrella type (a sealed class) itself for the following concrete types
      (it is still possible to instantiate it as the invoke operator on its companion is overridden accordingly):
      • [CborPositiveInt] represents all Long numbers ≥0
      • [CborNegativeInt] represents all Long numbers <0
    • [CborString] maps to a Kotlin String
    • [CborFloat] maps to Kotlin Double
    • [CborByteString] maps to a Kotlin ByteArray and is used to encode them as CBOR byte string (in contrast to a list
      of individual bytes)
  • [CborList] represents a CBOR array. It is a Kotlin [List] of CborElement items.

  • [CborMap] represents a CBOR map/object. It is a Kotlin [Map] from CborElement keys to CborElement values.
    This is typically the result of serializing an arbitrary

Example

bf                                 # map(*)
   61                              #   text(1)
      61                           #     "a"
   cc                              #   tag(12)
      1a 0fffffff                  #     unsigned(268,435,455)
   d8 22                           #   base64 encoded text, tag(34)
      61                           #     text(1)
         62                        #       "b"
                                   #     invalid length at 0 for base64
   20                              #   negative(-1)
   d8 38                           #   tag(56)
      61                           #     text(1)
         63                        #       "c"
   d8 4e                           #   typed array of i32, little endian, twos-complement, tag(78)
      42                           #     bytes(2)
         cafe                      #       "\xca\xfe"
                                   #     invalid data length for typed array
   61                              #   text(1)
      64                           #     "d"
   d8 5a                           #   tag(90)
      cc                           #     tag(12)
         6b                        #       text(11)
            48656c6c6f20576f726c64 #         "Hello World"
   ff                              #   break

Decoding it results in the following CborElement (shown in manually formatted diagnostic notation):

CborMap(tags=[], content={  
    CborString(tags=[],   value=a) = CborPositiveInt( tags=[12],     value=268435455),  
    CborString(tags=[34], value=b) = CborNegativeInt( tags=[],       value=-1),  
    CborString(tags=[56], value=c) = CborByteString(  tags=[78],     value=h'cafe),  
    CborString(tags=[],   value=d) = CborString(      tags=[90, 12], value=Hello World)  
})

Implementation Details

I tried to stick to the existing CBOR codepaths as closely as possible, and the approach to add tags directly to CborElements is the most pragmatic way of getting expressiveness and convenient use. It does come with a caveat (also taken from the Readme:

Tags are properties of CborElements, and it is possible to mixing arbitrary serializable values with CborElements that
contain tags inside a serializable structure. It is also possible to annotate any [CborElement] property
of a generic serializable class with @ValueTags.
This can lead to asymmetric behavior when serializing and deserializing such structures!

The test cases (and comments in the test cases reflect this

Closing Remarks

I also fixed a faulty hex input test vector that I introduced myself, last year, if I pieced it together correctly (see here) and I amended the benchmarks. (see here).

Since the commits from here will be squashed anyways, I did not care for a clean history.

@JesusMcCloud
Copy link
Contributor Author

Full disclosure: This PR incorporates code from a draft generated by Junie (albeit an impressive draft that saved a day of work). This is not a dumb copypasta of AI-generated code. Even if it were already feature-complete It would still not yet be marked ready for review because we have yet to review everything internally. I also want to stress that "we" is not a euphemism. There will be at least two of us reviewing and discussing internally, almost certainly with additional input from other humans in the process of readying this PR.

Copy link
Contributor

@pdvrieze pdvrieze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the code at a general level. There is a lot of repetition, so I've only commented on the first case, not every one. I guess some of the AI generation is visible in the details.

@JesusMcCloud JesusMcCloud changed the base branch from master to dev August 5, 2025 11:26
@JesusMcCloud JesusMcCloud force-pushed the feature/structuredCbor branch 2 times, most recently from 674e7ef to aea3f4b Compare August 7, 2025 10:12
@JesusMcCloud JesusMcCloud force-pushed the feature/structuredCbor branch 2 times, most recently from cd963c7 to c89fd46 Compare August 7, 2025 11:27
@JesusMcCloud
Copy link
Contributor Author

Performance seems to be OK (fromBytes and toBytes are the baseline on my machine):

Metric / Benchmark fromBytes fromStruct structFromBytes toBytes structToBytes toStruct
Average (ops/ms) 1205.615 ± 20.541 1545.814 ± 50.743 2896.728 ± 74.485 2089.013 ± 30.152 1442.766 ± 32.257 2581.397 ± 32.497
Min 1186.023 1458.225 2796.131 2066.499 1404.482 2550.026
Max 1229.778 1581.420 2960.572 2125.658 1475.015 2619.815
Stdev 13.586 33.563 49.267 19.944 21.336 21.495
CI low (99.9 %) 1185.075 1495.071 2822.244 2058.861 1410.509 2548.900
CI high (99.9 %) 1226.156 1596.557 2971.213 2119.165 1475.023 2613.893

My hot takes:

  • Deserialising from a structure is fast enough since it is in the same ballpark as deserialising from bytes
  • Deserialising into a generic CBOR structure takes twice the time than directly deserialising, which is fine, given that we instantiate much more as even primitives need a containing class and an array of tags
  • Serialising a generic CBOR structure to bytes is faster but in the same ballpark as generic to-byte serialisation of arbitrary serializable data
  • Serializing to a CBOR structure is slower than to bytes, but OK enough, since it's in the same ballpark and we instantiate more

@JesusMcCloud
Copy link
Contributor Author

JesusMcCloud commented Aug 7, 2025

I just noticed something that looks weird to me. See this test case here that is failing and closely compare expected vs actual.

the byte string is wrapped twice for the reference. I know there were some discussions, but I don't recall them, so I have to ask: why? did I mess this up last year or is this intentional? Because the way I see it, were' wrapping a bytearray instead of encoding it differently
EDIT: the test vector is faulty as this comparison fails the same way

@JesusMcCloud JesusMcCloud force-pushed the feature/structuredCbor branch from 7cdbe4d to 9b3f0e5 Compare August 8, 2025 13:22
@JesusMcCloud
Copy link
Contributor Author

Any updates on the open discussion points?

@fzhinkin fzhinkin self-assigned this Oct 7, 2025
@fzhinkin fzhinkin self-requested a review October 7, 2025 17:18
@fzhinkin fzhinkin added the cbor label Oct 14, 2025
Copy link
Contributor

@fzhinkin fzhinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JesusMcCloud, sorry for the delay with review. 😿

I didn't thoroughly review the serialization machinery, but focused on the API part for now.


/*need to expose writer to access encodeTag()*/
internal fun Encoder.asCborEncoder() = this as? CborEncoder
?: throw IllegalStateException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to cover this behavior with tests.

*/
@Serializable(with = CborIntSerializer::class)
@ExperimentalSerializationApi
public class CborInt(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about CborInteger instead? It's closer to major type descriptions from the spec (an unsigned integer, a negative integer) and it is farther from Kotlin's Int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with both. Completely up to you guys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it CborInteger then.

}

/**
* **WARNING! Possible truncation/overflow!** E.g., `-2^64` -> `1`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent truncation/overflow is something we should avoid. I'd vote for an exception if a value could not be represented as Long (Int, Short, or Byte - more on that later), and an additional toLongOrNull function, that'll return null on overflow.

It would be nice to align the API with what we already have for JsonElement and instead of providing a member function, provide extension properties.
And it seems reasonable to provide properties returning other integer types too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toLongOrNull sound good (also for other integer types)!

What would be core functionality (implemented as member functions) and what should be extensions? Or is it really just "look at JsonElement and replicate that" without leaving anything ambiguous?

@Serializable(with = CborNullSerializer::class)
@ExperimentalSerializationApi
public class CborNull(vararg tags: ULong) : CborPrimitive<Unit>(Unit, tags)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topic for later discussion: do we need a CborUndefined? It's one of the values that could not be parsed with the existing API, yet it is something a valid CBOR document might contain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! We need a CborUndefined. I totally forgot about that!

* traditional methods like [List.get] or [List.size] to obtain CBOR elements.
*/
@Serializable(with = CborListSerializer::class)
public class CborList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to rename @CborArray into something else, and reuse the name here. And it would be also nice to rename @ByteString annotation too.

While CBOR format is experimental, we are free to break things, but it would be nice to make a transition a bit smoother, and, probably, deprecate @CborAray right now and provide an alternative annotation, and as a next step provide the CborElement API with CborArray reused for the aggregate type.

@sandwwraith WDYT?

")"
}

} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few things that are currently missing, but that would otherwise improve DX when it comes to using CborElement API:

  • casting functions/properties such as
    • val CborElement.cborFloat: CborFloat, val CborElement.cborList: CborList
    • fun CborElement.asCborFloat(): CborFloat, ...
  • builders for aggregates, similar to JsonObjectBuilder and JsonArrayBuilder

Those could be added later, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these should be there and all of those are easy wins, given a foundation is there and I can add all of these. I'd just like a complete list (even if it is "only" a link to some JSON API Docs or whatever). BUT: I'd really like this PR to make it into the closest possible release, as we have a very concrete need for this.

That being said, I'd propose the following strategy going forward:

  • I'll address all the open issues that are sorted out already (i. e. stuff that just needs to be done).
  • All the open discussion points that affect the API should be sorted out ASAP
  • We will integrate a snapshot build into our codebase, so we have some real-world feedback to check if the API is missing anything or if some behaviour is too limiting
  • Then we'll finalise this PR
  • Depending on how large everything is and how pressing the need for builders is, they should be part of this PR or part of a separate one. I just don't want something small being the reason for having to wait for another release cycle…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how large everything is and how pressing the need for builders is, they should be part of this PR or part of a separate one

We can definitely add them as a follow up:

Those could be added later, though.


override fun deserialize(decoder: Decoder): CborFloat {
decoder.asCborDecoder()
return CborFloat(decoder.decodeDouble())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tags will be lost here during deserialization (the same is true for CborMap and CborList deserialization routines).

override fun isElementOptional(index: Int): Boolean = original.isElementOptional(index)
}

private fun CborWriter.encodeTags(value: CborElement) { // Encode tags if present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unused

val cborEncoder = encoder.asCborEncoder()
cborEncoder.encodeTags(value.tags)
//this we really don't want to expose so we cast here
(cborEncoder as CborWriter).encodeByteString(value.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the CborEncoder is public and it could be anything else.

}

decoder.decodeNull()
return CborNull()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tags will be lost here too

fun finalize() = when (this) {
is List -> CborList(content = elements, tags = tags)
is Map -> CborMap(
content = if (elements.isNotEmpty()) IntRange(0, elements.size / 2 - 1).associate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worth adding a check that elements.size is even.

is Map -> CborMap(
content = if (elements.isNotEmpty()) IntRange(0, elements.size / 2 - 1).associate {
elements[it * 2] to elements[it * 2 + 1]
} else mapOf(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also vote for

buildMap {
    for (i in 0 until elements.size / 2 - 1) {
        put(elements[i * 2], elements[i * 2 + 1])
     }
}

as an alternative to if (..) associate {} or mapOf()

protected val elements = elements

var tags = tags
internal set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CborContainer is internal anyway

class List(tags: ULongArray, elements: MutableList<CborElement> = mutableListOf()) :
CborContainer(tags, elements)

class Primitive(tags: ULongArray) : CborContainer(tags, elements = mutableListOf()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primitive is unused and could be removed, isn't it?

public fun <T> encodeToCborElement(serializer: SerializationStrategy<T>, value: T): CborElement {
val writer = StructuredCborWriter(this)
writer.encodeSerializableValue(serializer, value)
return writer.finalize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not support primitive values (try Cbor.encodeToCborElement(42)).
It also fails for ByteArrays when alwaysUseByteString = true.


// value tags are collects inside beginStructure, so we need to cache them here and write them in beginStructure or encodeXXX
// and then null them out, so there are no leftovers
private var nextValueTags: ULongArray = ulongArrayOf()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to be a function with a name that'll give away the behavior.

Also, I'm not sure that reallocating array every time we need to append a tag is the best way to collect tags. ProtobufTaggedBase shows a more efficient approach, although for CBOR tagsStack-analogue should be nullable and has null value by default (I assume that elements are usually untagged).

isClass = descriptor.getElementDescriptor(index).kind == StructureKind.CLASS

encodeByteArrayAsByteString = descriptor.isByteString(index)
//TODO check if cborelement and be done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't get what's left to do here

* @throws [IllegalArgumentException] if the decoded input cannot be represented as a valid instance of type [T]
*/
@ExperimentalSerializationApi
public inline fun <reified T> Cbor.decodeFromCborElement(element: CborElement): T =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following round-trip transformation does not work with elements, but works find when encoding to / decoding from a byte array:

        @Serializable data class Wrapped(val x: Int)
        @Serializable data class Wrapper(val datum: Wrapped?)
        val wrapper = Wrapper(null)
        Cbor.decodeFromCborElement<Wrapper>(Cbor.encodeToCborElement(wrapper))

* Deserializes the given [element] element into a value of type [T] using a deserializer retrieved
* from reified type parameter.
*
* @throws [SerializationException] if the given JSON element is not a valid CBOR input for the type [T]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are no such JsonElements that are valid CBOR inputs ;)

override fun isNull() =
if (layer.isStructure) layer.peek().let {
it is CborNull ||
/*THIS IS NOT CBOR-COMPLIANT but KxS-proprietary handling of nullable classes*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a null-as-an-empty-map and an empty map (or an object w/o properties) have different encoding and are distinguishable from each other.

However, once we have an empty CborMap, it's impossible to say if it represents a null, or an empty map.

As a result, it leads to some ambiguity:

        @Serializable class Wrapped
        @Serializable data class Wrapper(val x: Wrapped?)
        @Serializable data class StrictWrapper(val x: Wrapped)
        val b = Wrapper(Wrapped())
        
        val element = Cbor.decodeFromByteArray<CborElement>(Cbor.encodeToByteArray(b))
        // Wrapper(x=null), but x is not null here
        println(Cbor.decodeFromCborElement<Wrapper>(element))
        // StrictWrapper(x=kotlinx.serialization.cbor.CborElementTest$cborEncodeDecodeNull$Wrapped@57ae95ca)
        println(Cbor.decodeFromCborElement<StrictWrapper>(element))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying issue here is the non-compliant default behaviour that requires this special handling of nullable classes, because with correct defaults, we would not need to take this branch for maps, wouldn't we?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify what do you mean by "non-compliant default behavior"? Null objects being represented by empty maps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to #2848. Probably mo own doing, but I'm still investigating

* Appends the given CBOR [element] to the current output.
* This method is allowed to invoke only as the part of the whole serialization process of the class,
* calling this method after invoking [beginStructure] or any `encode*` method will lead to unspecified behaviour
* and may produce an invalid CBOR result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, this is a very strange limitation (which I also see present in CborDecoder).
Could you please describe it in more detail, and why we can't do this?

Comment on lines +72 to +80
/**
* Encode a negative value as [CborInt]. This function exists to encode negative values exceeding [Long.MIN_VALUE]
*/
public fun encodeNegative(value: ULong)

/**
* Encode a positive value as [CborInt]. This function exists to encode negative values exceeding [Long.MAX_VALUE]
*/
public fun encodePositive(value: ULong)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit strange that we expose those.
If users need those, they should be able to just do:

encodeCborElement(CborInteger(...))

@ExperimentalSerializationApi
public class CborInt(
absoluteValue: ULong,
public val sign: Sign,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I'm fine with a flag like isPositive/isNegative.
Should we also then add an assertion that if the value is 0, we should have isPositive=true?

public fun encodeCborElement(element: CborElement): Unit = encodeSerializableValue(CborElementSerializer, element)

/**
* Allows manually encoding CBOR tags. Use with caution, as it is possible to produce invalid CBOR if invoked carelessly!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that it was my suggestion to add this function, but I was thinking that it would be just an alternative to *Tags annotations.
As per the specification, tags are encoded as is, followed by the data items. So something like:

encodeTags(...)
encodeTags(...) // this should also work fine
encodeInt(...) // or anything else

So the only erroneous call to encodeTags is the one which will not be followed by another encode* call, am I right?

In this case, maybe we need to change the signature to something like encodeWithTags(...) { this: Encoder -> } so that we can easily check this? Or, we can do it similarly to inline classes encoding?
Or perhaps it means we need to rethink this API (and maybe all other tag-related API)?

In any case, I think it would be beneficial to explore how real-world use cases (with tags) might be implemented with this API without using the @Serializable annotation.

Here is one of the RFCs, which is not COSE.
Multi-tag example:

d8 c8               # tag(200) envelope
   d8 c9            # tag(201) leaf
      65            # text(5)
         416c696365 # "Alice"

Array example:

d8 c8                     # tag(200) envelope
   82                     # array(2)
      d8 c9               # tag(201) leaf
         65               # text(5)
            416c696365    # "Alice"
      a1                  # map(1)
         d8 c9            # tag(201) leaf
            65            # text(5)
               6b6e6f7773 # "knows"
         d8 c9            # tag(201) leaf
            63            # text(3)
               426f62     # "Bob"

@JesusMcCloud
Copy link
Contributor Author

Thanks for alle the comments! I'll have to dig up some memories that have since collected dust to sort some of the issue out and figure some stuff out again from scratch, as I haven't looked into this for many weeks and forgotten about most of the implementation details ;-). So it will take a bit before I'll push changes, addressing issues.

*/
internal object CborMapSerializer : KSerializer<CborMap>, CborSerializer {
private object CborMapDescriptor :
SerialDescriptor by MapSerializer(CborElementSerializer, CborElementSerializer).descriptor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have SerialDescriptor(serialName, original) function

*/
internal object CborByteStringSerializer : KSerializer<CborByteString>, CborSerializer {
override val descriptor: SerialDescriptor =
PrimitiveSerialDescriptor("kotlinx.serialization.cbor.CborByteString", PrimitiveKind.STRING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CborByteString is a special case of a ByteArray, rather than a regular string. And that should be expressed in the seriailzer's serial descriptor.

Comment on lines +65 to +66
public sealed class CborPrimitive<T : Any>(
public val value: T,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to get rid of a type parameter here, and make value non public:

  • some of the inheritors don't really fit into this design:
    • CborNull should not have a value at all (most likely), and Unit value looks off there.
    • CborInt[eger]s value is a pair of ULong and a sign flag, not just a ULong itself; using its value as of now may be an error.
  • right now, CborPrimitive is almost exclusively used as CborPrimitive<*> throughout a codebase
  • if main purpose of the value property is reducing the number of identical hashCode/equals/toString overrides, then we can turn it into a protected Any-valued one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generic CBOR Parsing for EUDIW

5 participants