[5.4] - RavenDB-24810 & RavenDB-24815 #77

grisha-kotler · 2025-08-04T19:51:47Z

Consists of 2 changes:

High usage of LOH memory - Replace arrays used in the term cache and for sorting that exceed 64KB with unmanaged arrays. This reduces the LOH size significantly. Although this cache is long-lived, it can be disposed of when no longer needed, which would leave gaps in the LOH and cause fragmentation. Moving to unmanaged memory eliminates this fragmentation risk entirely.
https://issues.hibernatingrhinos.com/issue/RavenDB-24810/High-usage-of-LOH-memory-in-Lucene
Reduce object reference overhead - Store field numbers (int) in arrays instead of field name strings. This reduces the number of objects that the GC needs to traverse, thereby significantly reducing Gen 2 GC time.
https://issues.hibernatingrhinos.com/issue/RavenDB-24815/Large-reference-graph-for-arrays-of-strings

…y and use the the unmanaged array for it

src/Lucene.Net/Index/TermBuffer.cs

src/Lucene.Net/Search/FieldCacheRangeFilter.cs

src/Lucene.Net/Search/FieldComparator.cs

src/Lucene.Net/Util/IArray.cs

ayende · 2025-08-05T08:27:41Z

src/Lucene.Net/Util/HybridArray.cs

+        if (length < 0)
+            throw new ArgumentOutOfRangeException(nameof(length));
+
+        if (length * sizeof(T) > LOH_THRESHOLD)


Why don't make everything unmanaged?

I think using the ArrayPool will be more cost-effective for small allocations. For long arrays, using this threshold will accommodate 8,192 elements, while for TermInfo (which is 32 bytes), it will contain up to 2,048 elements. Smaller segments with fewer cached terms are more likely to be merged frequently than larger segments, which creates more allocation and deallocation cycles.

The problem here is that you looking at memory only, but the virtual method calls, etc are also costly.
why not do everything in unmanaged?

I removed the virtual method calls.

And you added two branches for each index? And if statemen ton each op.
Why is that something that we want to do?

This is sth that we can check, but in general I think that calling AllocHGlobal and FreeHGlobal for each of those each time will have a much bigger penalty. Smaller allocations should be managed and come from a pool IMO in this case.

Copying data between managed and unmanaged memory isn't for free. Also we have more context here as @grisha-kotler pointed in this first answer. So for me it seems reasonable to use unmanaged only for bigger allocations (since we're trying to solve LOH compaction issue).

Regarding additional if making two branches, sometimes we tried to optimize code to make it branchless but not always it resulted in better performance.

I think the decision should be made based on perf testing results instead of relaying on intuition. It's too important code path. Note that it will require quite comprehensive testing, with variety of different scenarios.

src/Lucene.Net/Index/TermBuffer.cs

src/Lucene.Net/Util/IArray.cs

maciejaszyk · 2025-08-05T09:12:52Z

src/Lucene.Net/Search/Function/ReverseOrdFieldSource.cs

 		private class AnonymousClassDocValues:DocValues
 		{
-			public AnonymousClassDocValues(int end, int[] arr, ReverseOrdFieldSource enclosingInstance)
+			public AnonymousClassDocValues(int end, IArray<int> arr, ReverseOrdFieldSource enclosingInstance)


We could avoid boxing here by using TArray where TArray : IArray<int>

and during construction:

if (arr.GetType() == typeof(ManagedArray<int>)) return new AnonymousClassDocValues<ManagedArray<int>>(end, (ManagedArray<int>)arr, this); return new AnonymousClassDocValues<UnmanagedArray<int>>(end, (UnmanagedArray<int>)arr, this);

arekpalinski · 2025-08-06T07:45:10Z

src/Lucene.Net/Util/HybridArray.cs

+
+        if (length * sizeof(T) > LOH_THRESHOLD)
+        {
+            _ptr = UnmanagedStringArray.Segment.AllocateMemory(_elementSize * _length, type);


While agree this is good and beneficial to move it to unmanaged memory, I'd like to make a note here for us to consider. I think this is big change for us which potentially might be source of different problems.

I'm thinking about potential issues related to:

unmanaged heap fragmentation

long lived unmanaged allocations - since it's going to be kept by a cache

I've introduced a UseOnlyManagedArray flag to give us precise control over this behavior and mitigate the risks.

For 32-bit: You are absolutely right. Unmanaged heap fragmentation can become a serious issue due to the limited virtual address space. To prevent this, we will set UseOnlyManagedArray = true in our 32-bit builds, completely avoiding unmanaged allocations and the associated risks.
For 64-bit: This is far less of a concern. The vast virtual address space in 64-bit processes makes it highly unlikely for fragmentation to cause allocation failures. The performance benefits of avoiding the LOH outweigh the negligible risk of fragmentation here.

Regading the long-lived unmanaged allocations, this is a deliberate trade-off. While these allocations might be long-lived because they are held by a cache, it is preferable to have them in the unmanaged heap rather than the LOH.

arekpalinski · 2025-08-06T07:45:15Z

src/Lucene.Net/Util/HybridArray.cs

+        if (length < 0)
+            throw new ArgumentOutOfRangeException(nameof(length));
+
+        if (length * sizeof(T) > LOH_THRESHOLD)


Copying data between managed and unmanaged memory isn't for free. Also we have more context here as @grisha-kotler pointed in this first answer. So for me it seems reasonable to use unmanaged only for bigger allocations (since we're trying to solve LOH compaction issue).

Regarding additional if making two branches, sometimes we tried to optimize code to make it branchless but not always it resulted in better performance.

I think the decision should be made based on perf testing results instead of relaying on intuition. It's too important code path. Note that it will require quite comprehensive testing, with variety of different scenarios.

arekpalinski · 2025-08-06T07:49:40Z

src/Lucene.Net/Index/ArrayHolder.cs

+            using (_indexPointers)
+            using (_termInfos)
+            {
+                OnArrayHolderDisposed?.Invoke(_managedAllocations);


Previously we called it before actually releasing other resources but now we do it before. It that okay? (just sanity check, I don't know who uses OnArrayHolderDisposed)

OnArrayHolderDisposed is used for monitor the managed memory usage.
I moved it after the disposal.

arekpalinski · 2025-08-06T07:57:16Z

src/Lucene.Net/Search/FieldCacheImpl.cs

                // We'll return the old StringIndexCache and let the caller decide if he wants to dispose it sooner

-                var copy = _stringIndexCache;
+                var disposable = new DisposableAction(() =>


Can we just pass _stringIndexCache, _longCache, _doubleCache and dispose them explicitly there? So this way we won't need to allocate and use Action.

DisposableAction is private so we can make it's implementation more specific to what we deal with here.

What I meant here was to have DisposableAction(StringIndexCache stringIndexCache, LongCache longCache, DoubleCache doubleCache) instead of DisposableAction(Action action) so we don't need to do the allocation of Action

…locations

…use only the ManagedArray

…nlined helper method.

…dispose them explicitly

grisha-kotler added 8 commits August 4, 2025 21:53

RavenDB-24810 - use unmanaged arrays for arrays larger than 64KB

d79274b

RavenDB-24810 - add Length and AsReadOnlySpan

468052c

RavenDB-24810 - use IArray for order by long

175eac9

RavenDB-24810 - use IArray for order by double

f1e1df0

RavenDB-24810 - add a finalizer for the UnmanagedArray

33c3972

RavenDB-24815 - use the field number array instead of the string arra…

e89b3de

…y and use the the unmanaged array for it

RavenDB-24810 - bump lucene version

3a9236d

RavenDB-24810 - add tests

7f23c7a

grisha-kotler requested review from arekpalinski, ayende and ppekrol August 5, 2025 07:35

ayende requested changes Aug 5, 2025

View reviewed changes

maciejaszyk approved these changes Aug 5, 2025

View reviewed changes

RavenDB-24810 - formatting and initialize the fieldNumber

cb838fb

grisha-kotler force-pushed the RavenDB-24810 branch 3 times, most recently from 3e602e2 to c90f8ff Compare August 5, 2025 14:24

grisha-kotler requested review from ayende and maciejaszyk August 5, 2025 14:24

arekpalinski reviewed Aug 6, 2025

View reviewed changes

grisha-kotler added 10 commits August 6, 2025 14:07

RavenDB-24810 - use an indexer instead of Span and ReadOnlySpan

105c0a3

RavenDB-24810 - skip clearing the array for the Term Cache

f622619

RavenDB-24810 - add a finalizer for the ManagedArray

4f4cb0b

RavenDB-24810 - make the ManagedArray and UnmanagedArray classes sealed

6471dc4

RavenDB-24810 - use a value tuple in order to reduce the number of al…

4176bb0

…locations

RavenDB-24810 - skip materializing the term when not needed

6185a40

RavenDB-24810 - call OnArrayHolderDisposed after the actual disposal

9f86c50

RavenDB-24810 - bump lucene version

95f595c

RavenDB-24810 - allow to disable the usage of the UnmanagedArray and …

a061b16

…use only the ManagedArray

RavenDB-24810 - optimize indexer by moving exception throw to a non-i…

a27a51c

…nlined helper method.

RavenDB-24810 - pass _stringIndexCache, _longCache, _doubleCache and …

00c6627

…dispose them explicitly

grisha-kotler force-pushed the RavenDB-24810 branch from c90f8ff to 00c6627 Compare October 21, 2025 08:53

arekpalinski approved these changes Oct 21, 2025

View reviewed changes

maciejaszyk approved these changes Oct 21, 2025

View reviewed changes

arekpalinski merged commit f37cf97 into ravendb:ravendb/v5.4 Oct 21, 2025
2 checks passed

grisha-kotler changed the title ~~RavenDB-24810 & RavenDB-24815~~ [5.4] - RavenDB-24810 & RavenDB-24815 Oct 22, 2025

grisha-kotler mentioned this pull request Oct 22, 2025

[6.0] - RavenDB-24810 & RavenDB-24815 - 5.4 to 6.0 #81

Merged

arekpalinski mentioned this pull request Oct 23, 2025

[5.4] - RavenDB-24810 & RavenDB-24815 ravendb/ravendb#21635

Merged

26 tasks

grisha-kotler mentioned this pull request Oct 23, 2025

[6.2] - RavenDB-24810 & RavenDB-24815 ravendb/ravendb#21638

Merged

26 tasks

[5.4] - RavenDB-24810 & RavenDB-24815 #77

[5.4] - RavenDB-24810 & RavenDB-24815 #77

Uh oh!

Conversation

grisha-kotler commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grisha-kotler Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

grisha-kotler commented Aug 4, 2025 •

edited

Loading

grisha-kotler Oct 5, 2025 •

edited

Loading