Considering DBreeze

Aug 6, 2013 at 4:36 PM
Edited Aug 6, 2013 at 4:39 PM
Hi,

I'm looking for an embeded database for my project which runs on Windows .Net 4.0 and Mono 3.2.

My goal here is to store a tree structure of information.
  • I need to be able to quickly enumerate the tree without loading the entire tree at once.
  • Is having a key-value store table for each level of my tree an eventual or proven solution?
  • I could also use an objectpool for instanciations if possible.
  • I need to have very fast concurrent queries on two string indexes (basically a key and secondary key if I'm not mistaken?). Is opening a transaction for a select operation expensive?
The tree should not contain more than 100 000 nodes. Tell me what you think :)

I've tried object databases but I'm was not satisfied of the performance result regarding indexing and constant polling on that index.

Regards,

David
Aug 6, 2013 at 10:18 PM
Read DBreeze docu and give it a try. If u stuck somewhere - write here for a help.
Good luck!
Aug 7, 2013 at 10:03 AM
Hi,

I did read parts of the documentation. I was actually hoping for some pointers on my need vs DBreeze's typical usage. Is that something you can guide me with?

Regards,

David
Aug 7, 2013 at 10:20 AM
Hmmm...

"I'm looking for an embedded database for my project which runs on Windows .Net 4.0 and Mono 3.2."
  • that's it
    "I need to be able to quickly enumerate the tree without loading the entire tree at once. "
  • this you can, if we talk about the same meaning of "quickly".
    "Is having a key-value store table for each level of my tree an eventual or proven solution? "
  • ?
    "I could also use an objectpool for instanciations if possible. "
  • ?
    "I need to have very fast concurrent queries on two string indexes (basically a key and secondary key if I'm not mistaken?). Is opening a transaction for a select operation expensive?"
  • this you can, if we talk about the same meaning of "fast"
Seems, that you must make tests yourself.
Aug 7, 2013 at 10:30 AM
"The tree should not contain more than 100 000 nodes."
  • It's not a problem.
Aug 7, 2013 at 10:31 AM
Thank you!

By "Is having a key-value store table for each level of my tree an eventual or proven solution? "
I was wondering what would be the way to store a tree structure in a keyvalue store apart from being flat?

By "I could also use an objectpool for instanciations if possible. "
I was asking if I can provide a factory (that would be an object pool) for my objects instanciations. I want memory usage to stay as low as possible during the tree enumeration.

About the concurrent queries, what I'm doing up to now is I'm using an in-memory Dictionary behind a ReaderWriterLockSlim. I need to get as close as this performance wise.
Aug 7, 2013 at 10:48 AM
"Is having a key-value store table for each level of my tree an eventual or proven solution? "
  • being flat or being a trie - always depends upon quantity of records.
"I could also use an objectpool for instanciations if possible. "
  • you don't need to take care about that, DBreeze doesn't use RAM caches and its memory consumption is always on the minimal level.
"About the concurrent queries,"
  • Dictionary of course will be faster, because it's in RAM and DBreeze normally must be used for a huge quantity of elements which are located on the DISK. Thou, DBreeze also can work as in-Memory DB. Speed of the data access you can watch in Benchmark docu.
Aug 7, 2013 at 10:50 AM
Edited Aug 7, 2013 at 10:51 AM
"I could also use an objectpool for instanciations if possible. "
  • If you talk about the factory for deserialization of objects - that's out of DBreeze scope.... on the low level Key and Value of DBreeze is byte[]. The rest (objects, columns etc..) is the other level of abstraction.
Aug 7, 2013 at 12:30 PM
"Is having a key-value store table for each level of my tree an eventual or proven solution? "
Ok. The idea is that I was aiming at is having the children of a node at O(1) reach for enumeration.
About it being a trie, I don't understand how to achieve that with DBreeze, can you point me to where I can dig about that or explain your point of view?

"I could also use an objectpool for instanciations if possible. "
Ok so basically if I manage the serialization/deserialization to byte[], it allows me to use an object pool. The idea is to pool my objects during an enumeration, so that I don't have to wait for the GC to collect the enumerated items to keep the memory usage low.

"About the concurrent queries,"
Of course. I meant I just wanna get the closest possible :)
Aug 7, 2013 at 1:41 PM
Edited Aug 7, 2013 at 1:41 PM
"Is having a key-value store table for each level of my tree an eventual or proven solution? "

Please, give me more information about the structure which you are going to build up.
Because DBreeze table is a search trie, where every key is a leaf.
Aug 7, 2013 at 1:44 PM
Edited Aug 7, 2013 at 1:44 PM
The structure is a tree. Each node has a string identifier that is a path like in a trie.
Node / is root, with children /a/, /b/, /c/ and so on. They're like file paths.

I will need to access those node by their string identifer. Or by traversing from root. I also have a secondary identifier that must reference the same value.
Aug 7, 2013 at 1:54 PM
Edited Aug 7, 2013 at 1:55 PM
Actually Dbreeze is a ready search trie
using(var tran = Engine.GetTransaction())
{
   tran.Insert<string,string>("table1","Hello","val1");
   tran.Insert<string,string>("table1","Hi","val2");
   tran.Insert<string,string>("table1","Bye","val3");
   tran.Commit();
}

using(var tran = Engine.GetTransaction())
{
   foreach(var row in tran.SelectForward<string,string>("table1"))
  {
     Console.WriteLine(row.Key);
  }
}
Aug 7, 2013 at 2:07 PM
Every table is a separate search trie (Liana-Trie, more in docu.)
Aug 7, 2013 at 2:09 PM
Ah nice. So basically going forward is a full tree depth first traversal in my case.

But what if I'm not working on the full tree, if I want to traverse from a node other than the root? Can I lookup my start node then go Forward from there and then manually break the traversal using the string's StartsWith method?
Aug 7, 2013 at 2:11 PM
When you press tran. Intellisense will give you a huge method lists: From,To, Backward, Forward, Skip StartWith...etc....read docu.
Aug 7, 2013 at 2:16 PM
Alright sounds good.

Thank you very much for this conversation.
Aug 7, 2013 at 2:33 PM
Your welcome!
Aug 8, 2013 at 5:37 PM
Maybe he could also have some use of NestedTables.
Aug 9, 2013 at 4:47 PM
So I've setup DBreeze as my persistance layer for a part of my application. It's damn easy and it just works. Cheers!

Now I'm looking out for performance. The following stopwatch records ~2300ms to go through this which contains 70k elements. I'd like to get this to the fatest result possible.
            var stopwatch = new Stopwatch();
            stopwatch.Start();
            foreach (var item in transaction.SelectForward<string, byte[]>(SynchronizedDataTableName))
            {
                if (item.Exists)
                {
                    stopwatch.Stop();
                    var theObject = ProtobufSerializationHelper.Deserialize<DataBaseFileEntry>(item.Value);
                    stopwatch.Start();
                    yield return theObject;
                }
                else
                {
                    throw new Exception();
                }                
            }
            stopwatch.Stop();
I can eventually split this into various transactions that would start reading from different positions using SelectForwardStartFrom and execute them in parallel. Or may be can I split this mathematically?

Any idea on how I could get this to be more performant using DBreeze itself? Like I said this is a tree structure. I don't mind having to change the way I store the tree in trade of performance. For now I simply do this:
transaction.Insert(SynchronizedDataTableName, dataBaseFileEntry.ServerIndex, ProtobufSerializationHelper.Serialize(dataBaseFileEntry), out ptr);
transaction.Insert(SynchronizedDataClientIndexTableName, dataBaseFileEntry.ClientIndex, ptr);
David
Aug 9, 2013 at 9:51 PM
Edited Aug 9, 2013 at 10:10 PM
Hard to say what you can improve in
foreach (var item in transaction.SelectForward<string, byte[]>(SynchronizedDataTableName))
using DBreeze itself.

Keys themselves can influence on the iteration speed. LianaTrie is a variation of radix trie. So, if you have many grapes on one branch the total gathering time will be smaller. If your keys are quite sparsed, DBreeze will need more time to traverse all of them.

In other words keys 1,2,3,4,5 will iterate faster then keys 1234,87987,654687897 etc.


BTW.
Checking if (item.Exists) is not necessary in iteration.
Aug 9, 2013 at 9:52 PM
Parallel reads are possible and will boost the total performance.
Aug 9, 2013 at 10:07 PM
Do you have an idea about a different way of expressing my tree using DBreeze that would allow better traversal performance?
Aug 9, 2013 at 10:16 PM
Please, give more information about your tree.
Aug 9, 2013 at 10:20 PM
And also describe your desired traversal speed expectations.
Also note, that:

Keys themselves can influence on the iteration speed. LianaTrie is a variation of radix trie. So, if you have many grapes on one branch the total gathering time will be smaller. If your keys are quite sparsed, DBreeze will need more time to traverse all of them.

In other words keys 1,2,3,4,5 will iterate faster then keys 1234,87987,654687897 etc.
Aug 9, 2013 at 10:24 PM
Also you can check STSdb database which is B+Trie. You will not have StartsWith there and also search key time will increase with the growing quantity of records. But traversing can be a bit faster in some cases - just try to find your trie.
Aug 9, 2013 at 10:55 PM
I'd love to be under 500ms.

Ok. I'll read more about how the trie works, I don't understand why keys matter if it's always ordered.

What if for my code snippet I wouldn't need to traverse and only enumerate in any order the content of the tree as a list, keys wouldn't matter right? Sometimes I need a traversal, sometimes I can be satisfied by enumerating in any order.

I don't know what extra info about the tree I could give you? Keys are strings representing a path. ie: / is root and has children /a/ and /b/, /a/ has child /a/1/, etc.
Aug 10, 2013 at 10:11 AM
First of all, I must tell that 500ms is not so far away from 2300ms and partly this problem can be solved by changing hardware.

I have made some tests with the "middle dispersion" index:

Inserting 100K of records with 200 bytes.
 using (var tran = engine.GetTransaction())
            {
                DateTime now = DateTime.Now;
                DBreeze.Diagnostic.SpeedStatistic.StartCounter("a");
                for (int i = 0; i < 100000; i++)
                {
                    now = now.AddDays(2);
                    tran.Insert<DateTime, byte[]>("A", now, new byte[200]);
                }

                tran.Commit();

                DBreeze.Diagnostic.SpeedStatistic.PrintOut("a",true);
            }
Total tableSize is 22MB, speed of insert is 1473 ms in Debug mode of DBreeze. Benchmark PC is taken (from Benchmark docu).

Ok, now I am going to traverse this table.
using (var tran = engine.GetTransaction())
            {
                DBreeze.Diagnostic.SpeedStatistic.StartCounter("a");

                int i = 0;

                foreach (var row in tran.SelectForward<DateTime, byte[]>("A"))
                {
                    i++;

                }

                DBreeze.Diagnostic.SpeedStatistic.PrintOut("a", true);
                Console.WriteLine("Cnt: {0}", i);
            }
Iteration without acquiring value took 480ms.

Now with acquiring value:
using (var tran = engine.GetTransaction())
            {
                DBreeze.Diagnostic.SpeedStatistic.StartCounter("a");

                int i = 0;
                byte[] val = null;

                foreach (var row in tran.SelectForward<DateTime, byte[]>("A"))
                {
                    i++;
                    val = row.Value;
                }

                DBreeze.Diagnostic.SpeedStatistic.PrintOut("a", true);
                Console.WriteLine("Cnt: {0}", i);
            }
Iteration with acquiring value took 1171 ms.

Now I will insert into the table all the same but value size will be 100 bytes instead of 200 bytes:
  • Insert time is 1107 ms
  • iteration with value acquiring is 1165
What I can think about is about making overload for Select to get value together with the key, it may economize up to 2x time (if value will be not more then 200-300 bytes - it depends upon the chunk size which we read from HDD, if key+value is bigger then the chunk we must make 2 HDD reads and we will not be able to economize).
For now any Select just reads key and a direct pointer to the value. When we call row.Value; only in that moment the value is read out via pointer.
Aug 10, 2013 at 11:40 AM
Edited Aug 10, 2013 at 11:40 AM
I'm working on an application that is distributed to clients. I can't change hardware :) I'm aiming for the best performance I can get.

Directly selecting the value (or also without even returning the key) instead of lazy loading it is a good idea for the enumerable.

On monday I'll run your above bench. I'll also try one with string keys which look like mine and share. If my keys are problematic, instead of managing a table and a secondary index, I could eventually manage one table with data and high performance keys for enumeration, and two secondary index tables pointing on the main table data pointers. I'll give it a try on monday too.
Aug 11, 2013 at 12:10 PM
For the tests take new DBreeze (57)
Aug 12, 2013 at 10:09 AM
Result of the above benches:
insert 1200
select without getting value 750
select with getting value 1500

I changed my solution to use int keys instead of strings, and managed indexes in other tables. Inserting my 70k objects take more time but enumerating is faster. I get 1400ms with getting value and deserializing it, 730 without getting/deserializing the value. This is great! I'm going to fragment my process functionally and parallelize some stuff and get back with some more results.

When do you think you could make the new overload that gets the value right straight away? Could we have the same kind of overloads for Select and SelectDirect?
Aug 12, 2013 at 10:44 AM
Main behavior I would like to leave like it is. Sometimes, when the value is small (it happens practically for all secondary indices), we can add (concatenate) value to the end of the key. We don't need to get value at all in this case. Sometimes lazy loading is good. The same idea comes when I think about SelectPart and quite big values.

About overloads I will have to think, it can take time.
Aug 12, 2013 at 10:57 AM
Edited Aug 12, 2013 at 10:57 AM
I don't get how you concatenate the value on the key and then how you retrieve it.

I think lazy loading is good, but it should be an option for Select* methods. As of when one wants to enumerate values, where keys don't even matter, lazy loading is not really useful in that case. But in my case, I will use SelectForward anyways to split my data into different reads.

So I just realized I can't split my data table functionally since I don't have string keys anymore. I think I will manage a hashset with available int keys that were released by deletion next to my "next id" value. That way I could have "pointers" on how to split my enumeration. Would you have a better idea? That said, it means I do need to analyze keys to know how I can traverse my data with SelectForwardFromTo with the best performance. An awesome feature would be to have a way to StartForwardFromIndexToIndex. I don't know if it's possible to identify a node by its position in a SelectForward for example?
Aug 12, 2013 at 1:05 PM
Please read documentation about DBreeze's Liana Trie (which is variation of radix trie).
Also read about secondary indexes and compound indexes in documentation.

Then you will understand how keys are stored and also you will understand that StartForwardFromIndexToIndex is the same as SelectForwardFromTo.
Aug 12, 2013 at 1:31 PM
Ok I'll read that.

In StartForwardFromIndexToIndex I meant like traverse position wise indexes. Like if natively with DBreeze I could Select a fixed range of items say the 50th through to the 100th. That way I could split my read operations like table.Count / 10 and that's it. If that's what you understood, my lecture will help :)
Aug 12, 2013 at 1:43 PM
You have Skip and SkipFrom options
Aug 12, 2013 at 2:27 PM
I thought I'd get too much overhead of going through keys many times (in my case the worker threads count). The docu says I won't. I can't really use the SkipFrom since I don't know the actual key to start from. Since I have to manage keys for my values I can make sure to keep a serie of integers and split the work according to the highest key in database then using SelectForwardFromTo. I think that's the best I can get no?

I've read about secondary indexes already. The compound key is a good idea. I'll give it a try.
Aug 12, 2013 at 2:39 PM
For me is very difficult to give you advices, because I don't know your problematic.
Aug 12, 2013 at 4:06 PM
Edited Aug 12, 2013 at 4:08 PM
I have published new Release DBreeze 59.
There you can say tran.ValuesLazyLoadingIsOn = false; and Select, SelectDirect and all traversal ops will be done without lazyLoading. ValuesLazyLoadingIsOn can be switched as many times as necessary inside of one transaction. Docu is enhanced with the explanation.
byte[] val=null;
using (var tran = engine.GetTransaction())
          
                for (int i = 0; i < 100000; i++)
                {
                    tran.Insert<int, byte[]>("A", i, new byte[2000]);
                }
                tran.Commit();
            }


            using (var tran = engine.GetTransaction())
            {

                tran.ValuesLazyLoadingIsOn = false;

                DBreeze.Diagnostic.SpeedStatistic.StartCounter("a");
                 foreach (var row in tran.SelectForward<int, byte[]>("A"))
                 {
                     val = row.Value; 
                     //Console.WriteLine("K: {0} - {1}", row.Key, row.Value);
                 }
                 DBreeze.Diagnostic.SpeedStatistic.PrintOut("a", true);

            }
Good luck in testing!
Aug 12, 2013 at 4:22 PM
About compound keys in order to avoid lazy loading the value and getting 1 HDD hit: I have 256 char strings as secondary indexes. I tried to insert my 70k items with To_FixedSizeColumn(256, true).Concat(ptr) and after 5 minutes I stopped waiting. It took 10 seconds beforehand. I then used the MixedMurMurHash3_64 which went fine. I don't ask for the index value anymore and I won't store the pointer in value anymore. The only way I can select a key now is using StartsWith only right?

About spliting my full tree values enumeration that I'd wan't to separate on various threads: I need to split the tree in N parts to read from. One option is to do so without key "intelligence" meaning I could simply skip entries accepting overhead. The other is to take advantage of the keys. For performance issues mentioned above, I used an integer key since my strings were much too slow. I can keep a managed pool of integers on which I could rely to have a start and finish boundaries on which to SelectForwardFromTo. Would you think I could do better?
Aug 12, 2013 at 4:27 PM
The only way I can select a key now is using StartsWith only right?
Would you think I could do better?
have no idea.... because don't know what and why your are putting there and what you are trying to achieve.
Aug 12, 2013 at 4:40 PM
Insert time problems?
Read about random keys insert in docu. They must be sorted in memory ascending before.
Read about Technical_SetTable_OverwriteIsNotAllowed.
Aug 12, 2013 at 4:48 PM
I don't understand what you want to know more...

The only way I can select a key now is using StartsWith only right?
  • I have a compound key. Can compound keys only be founds using Select*StartsWith ?
Would you think I could do better?
  • Let me put it this way:
    ** how would you do to read a table from multiple threads if you don't have information about the keys?
    ** if you don't mind about the key type or key content of a table, what kind of keys would you use to allow an easy way to split the table content using the keys in SelectForwardFromTo?
ValuesLazyLoadingIsOn = false
  • Nice! The enumeration gained 27% speed.
Aug 12, 2013 at 4:54 PM
Edited Aug 12, 2013 at 4:56 PM
Thanks about the insert time hint, it's my indexes that take time though. I can't do much about them. The speed is fine using the hashes.
Aug 12, 2013 at 5:25 PM
Edited Aug 12, 2013 at 9:02 PM
Any Key in Dbreeze table is byte[] of variable lenght.
So you can have
tran.Insert<byte[],int>("A", new byte[] {1,2,3},1)
tran.Insert<byte[],int>("A", new byte[] {1,2,3,4},1)
tran.Insert<byte[],int>("A", new byte[] {1,2,3,4,5},1)
tran.Insert<byte[],int>("A", new byte[] {1,2,3,5,5},1)
... if we sort them ASC we will get

1,2,3
1,2,3,4
1,2,3,4,5
1,2,3,5,5

When we insert Key as integer 1
tran.Insert<int,int>("A", 1,1)
on the low level it will be converted to 128,0,0,1
when 2->128,0,0,2
tran.Insert<int,int>("A", 1,1)
tran.Insert<int,int>("A", 2,1)
tran.Insert<int,int>("A", 3,1)
tran.Insert<int,int>("A", 4,1)

then read these keys as byte array sortd asc we will see
128,0,0,1
128,0,0,2
128,0,0,3
128,0,0,4

If we insert string key tran.Insert<string,int>("A", "Hi",1)
If we insert string key tran.Insert<string,int>("A", "Hi1",1)
If we insert string key tran.Insert<string,int>("A", "Hi123",1)

it will look like (approx. digits)
15,1
15,1,20
15,1,20,21,22

When I say select ForwardStartsWith("Hi1") return is
Hi1
Hi123

In case of integers When I say select ForwardFrom<byte[]>(new byte[] {128,0,0,3}) return is
128,0,0,3
128,0,0,4
In case of integers When I say select ForwardFrom<int>(3) return is
128,0,0,3
128,0,0,4

This is the base techniques for the radix tree.
If you have an event which happens every 5 seconds and you want to put it to Database as a primary key you can choose DateTime. It will also be converted into
byte[]. Try to tran.Insert<DateTime,int>("A", DateTime.Now,1) and select it using byte[] (var row in tran.SelectForward<byte[],int>) Console.Write(row.Key.ToByteString()) - having using Dbreeze.Utils for byte[] to diff datatypes conversions.

All keys will be byte[] and you can see them and apply radix trie operation ForwardFrom To, StartsWith etc... make test to understand how radix trie works.

To any key of any type you can apply StratsWith.... how you will supply params will depend upon your understanding the topic.
I can help when I know the business logic beneath.

About your second topic, may be you can insert those keys in 4 different tables t1, t2, t3, t4 in round-robin way, then run 4 parallel threads to read from different tables???
Aug 12, 2013 at 5:35 PM
Edited Aug 12, 2013 at 5:57 PM
Compound index.
We want to combine dateTime and integer. DateTime will be presented with 8 bytes and integer with 4 bytes.
In total 12 bytes
I am going to insert as a Key event time + identity, to make records unique, in case, if 2 events will take place at the same moment:

using Dbreeze.Utils;
int identity = 0;
            using (var tran = engine.GetTransaction())
            {
                byte[] key = null;
                DateTime now = DateTime.Now;

                for (int i = 0; i < 100; i++)
                {
                    now = now.AddSeconds(7);
                    i++;
                    key = now.To_8_bytes_array().Concat(i.To_4_bytes_array_BigEndian());
                    tran.Insert<byte[], string>("t1", key, "event");
                }

                tran.Commit();
            }
Now if I want to see which events happened between 12.02.2013 12:00 and 12.03.2013 12:00
I will have to make something like this:
 using (var tran = engine.GetTransaction())
            {
                DateTime start = new DateTime(2013,02,01,12,00,00);
                DateTime stop = new DateTime(2013,03,01,13,00,00);

                //Very important to get the same 12 bytes lenght for From TO
                byte[] StartKey = start.To_8_bytes_array().Concat(new byte[] { 0x00, 0, 0, 0 });
                byte[] StopKey = stop.To_8_bytes_array().Concat(new byte[] { 0xFF, 255, 255, 255 });

                foreach (var row in tran.SelectForwardFromTo<byte[], string>("A", StartKey, true, StopKey, true))
                {

                }
            }
Aug 12, 2013 at 5:55 PM
Now I want to create another compound index in one table I want to store setting for different users.
It's a secondary index and will have Direct pointer to master table
Key will have such structure: userId (long 8 bytes) + event start time (8 bytes) + Pointer To Master Table for Select Direct (8 bytes )
using (var tran = engine.GetTransaction())
            {
                byte[] key = null;
                long UserId = 8;
                DateTime startDT = DateTime.Now;
                byte[] ptr=new byte[8];

                key = UserId.To_8_bytes_array_BigEndian().ConcatMany(startDT.To_8_bytes_array(), ptr);
                tran.Insert<byte[], byte>("A", key, 1);
               
            }
Getting all events for the user with ID can be
 using (var tran = engine.GetTransaction())
            {
                byte[] key = null;
                long UserId = 9;

                byte[] StartKey = UserId.To_8_bytes_array_BigEndian().ConcatMany(new byte[] {0,0,0,0,0,0,0,0, 0, 0, 0, 0 });
                byte[] StopKey = UserId.To_8_bytes_array_BigEndian().ConcatMany(new byte[] {255,255,255,255,255,255,255,255, 255, 255, 255, 255 });

                foreach (var row in tran.SelectForwardFromTo<byte[], string>("A", StartKey, true, StopKey, true))
                {

                }

            }
or
using (var tran = engine.GetTransaction())
            {
                byte[] key = null;
                long UserId = 9;

                byte[] StartKey = UserId.To_8_bytes_array_BigEndian();

                foreach (var row in tran.SelectForwardStartsWith<byte[], string>("A", StartKey))
                {

                }

            }
Aug 12, 2013 at 5:55 PM
I understand the concepts you mentionned. I've actually read the documentation + wikipedia.

My question about the compound key is related the the context I've talked about in the above. I used to have a string key for secondary index and a pointer as value. I changed the key to a 8 bytes hash then converted to a byte array. I concatenated to it the pointer of the real record of my main table instead of putting the pointer as the value. You suggested me that in order to avoid two HDD hits. Now when I want to find the key without the actual pointer, I use SelectForwardStartsWith().FirstOrDefault() then decompose the key to get my data. What I'm asking is if there is another way to do so, other methods in DBreeze I don't know about or something?

About the parallel reads, I choosed int to keep my main table performant for enumeration, I use secondary indexes to lookup data. I'm asking if it sounds like a good idea to manage the int keys so that I can split my reads using SelectForwardFromTo? When I say manage, I mean that I would keep the int closer to a 1, 2, 3, n serie so that I can split my reads using the keys.

I did think about spliting the data into various tables. If I do so, I need to manage indexes that point to table + pointer. I'm still wondering if the option above is easier or cleaner because if I do use various tables, I still have to generate uniques keys.
Aug 12, 2013 at 6:38 PM
Now when I want to find the key without the actual pointer, I use SelectForwardStartsWith().FirstOrDefault() then decompose the key to get my data
I think in this case it's a correct technique.

About the rest, to give you advices I must know the nature of your keys... You always talk about some kind of keys which you want to enumerate, from one side and from the other side they are some kind of hash codes.
I don't know how much keys must be inserted how often and why do you wnat to read them in parallel. Do you delete them or not?

may be this keys must be identites, may be they must be deleted, may be you can get min, max and count to get active identity range..... may be you can insert them already as buckets???

who knows all this?
Aug 13, 2013 at 9:41 AM
Edited Aug 13, 2013 at 9:42 AM
Well I tried to explain to you the nature of my keys many times. I guess we just don't get eachoher.

"You always talk about some kind of keys which you want to enumerate"
  • I actually never said that. I want to enumerate the values only.
At first I said:

dlamb3rt wrote:
The structure is a tree. Each node has a string identifier that is a path like in a trie.
Node / is root, with children /a/, /b/, /c/ and so on. They're like file paths.

I will need to access those node by their string identifer. Or by traversing from root. I also have a secondary identifier that must reference the same value.
So I made a table with a string key and I've shown you my insert snippet and shared you my performance result for 70k nodes: 2300ms.

Then you told me: "If your keys are quite sparsed, DBreeze will need more time to traverse all of them."

So because I have long paths and that enumerating my table values with performance is important, I thought I should need to change key type. I based my decision on the example you gave me with the integers because I guess there's no simpler way: "I changed my solution to use int keys instead of strings, and managed indexes in other tables to represent my string keys. Inserting my 70k objects take more time but enumerating is faster." Please keep in mind that I don't care about what the main table key type is, I just want fast enumeration. My secondary indexes take care of lookup.

The reason I want to read in parallel is: "I'm aiming for the best performance I can get."

I talked about hashcodes yes. I said "About compound keys in order to avoid lazy loading the value and getting 1 HDD hit: I have 256 char strings as secondary indexes. I tried to insert my 70k items with To_FixedSizeColumn(256, true).Concat(ptr) and after 5 minutes I stopped waiting. It took 10 seconds beforehand. I then used the MixedMurMurHash3_64 which went fine." Basically my secondary indexes, which were the string paths, are now hashes instead of strings for performance. They are are byte[], byte[], the key being a compound key with a direct pointer to the main table record.

My last post 3rd and last paragraph explain the two options I have in mind. Wether I manage a tight int pool so that I can split easily my table in ranges. Using Min and Max is not so good because results won't be very fair regarding gaps between records. The other problem with this solution is that if there is massive deletions, I will have huge gaps. It's not a blocker but I wish I could avoid it.
I could manage buckets. If I do so, what do you think I should use for the main table keys? I won't really need the keys so it could be anything. Is having an autoincrement int key ok?

I must enumerate the values every short period of time, using parallel reads or not, about 2 minutes but also at random client triggered times. I want this process to be very reactive. Entries are inserted, updated and deleted without notice. The tree represents files and I cannot predict the behavior of the client. Assume that an entire tree can be deleted, inserted, etc. Typically it won't happen but that's the worst case scenario.
Aug 13, 2013 at 9:45 AM
About the table design, I did mention the two current options. But I'm wondering if there could be other ways.
Aug 13, 2013 at 9:59 AM
The fastest enumeration will be with identities 1,2,3,4,5 etc... But, don't mislead yourself, not so much faster then sparsed keys ... may be you can gain 20-30%
Aug 13, 2013 at 10:03 AM
dlamb3rt wrote:
I changed my solution to use int keys instead of strings, and managed indexes in other tables. Inserting my 70k objects take more time but enumerating is faster. I get 1400ms with getting value and deserializing it, 730 without getting/deserializing the value. This is great! I'm going to fragment my process functionally and parallelize some stuff and get back with some more results.
I gladly got 39% less overhead from 2300 to 1400!
Aug 13, 2013 at 10:06 AM
dlamb3rt wrote:
ValuesLazyLoadingIsOn = false
  • Nice! The enumeration gained 27% speed.
It's about 1100ms, it's actually a 21% gain from 1400.
Aug 13, 2013 at 12:35 PM
Excellent!
Aug 13, 2013 at 1:13 PM
Would you have any advice on using managed int vs buckets vs suggestion?
Aug 13, 2013 at 1:17 PM
For reading in parallel.
Aug 13, 2013 at 1:34 PM
In this terms, I mean "bucket" is the same container as in hash tables.
If you are able to collect your keys in buckets, it can mean you can read by buckets and iterate by buckets - this can be in-memory operations and can grant huge operational speed with a not very big RAM overhead. I think that DBreeze's Liana Trie is like a universal search trie. Optimization can come if you store indexes in buckets.
Unfortunately it's not always possible and depends upon the nature of the index.

For example values from 1...9999 = bucket0; 10000....19999 = bucket1; 20000...29999 = bucket2

take value, divide by 10000 and find out in which bucket it is located, store and retrieve values always in buckets.
In DBreeze you will have only indexes 0,1,2,3 which represent the bucket. Each value for such rows will be a Dictionary of 10000 items.
That's what I mean under buckets and that's why it's not possible for all keys to find such "magic hashcode" for creation such buckets.
Aug 13, 2013 at 2:08 PM
I could have n buckets with round robin insertion like you mentioned earlier in the discussion. In each one I would have 1,2,3,i identifiers auto incrementing (without recovering deleted identifiers). For enumeration, I would launch n reads and that's it. My index tables would point to a compound of the table name and a direct pointer.

If I would manage n buckets with n indexes like you just suggested. I would need still have to manage my identifiers so that deleted ones would be recovered. In order to make sure my buckets would be filled as much as possible. One thing though is I cannot put the indexes in buckets, because I couldn't know on which index to do the lookup.

I'm leaving out the idea of managing integer keys to know how to split one single table. I'll go for buckets for sure. I'll try both and I will share my results of the different overheads.



By the way I wanted to say thank you. This is a great product! The only thing I think I would enjoy especially in my case, is a simple table without keys so not in a tree. Enumerating the content would not need any traversal. Which should be much faster. I'm pretty sure I know what you're gonna answer but as we can see in my case is that I store my data and then I store myself the pointers using indexes in any design.
Aug 13, 2013 at 3:14 PM
Your welcome, good luck!
Aug 13, 2013 at 6:54 PM
New release was published, fixing some issues after last changes. As I see you are not subscribed to the project, so don't miss the fix.
Aug 14, 2013 at 1:24 PM
Got it, thanks.

Yesterday before I started testing parallelizing reads, I altered my usage of the SelectForward method by buffering 200 items. It went down from 1100ms to 800ms.

I then tried the method that I left out, splitting reads using the integer keys, since it was almost ready for testing. Splitting in two threads, I got about 630ms. Four threads is the same. I noticed that every thread, wether I use only one thread to read the whole table, or split in two or four threads, takes practically the same time. One thread reads the table in 500, two threads read the whole together in 500, etc. I don't understand why the resulting times won't reduce?
Aug 14, 2013 at 1:35 PM
Let me rephrase. Under normal conditions, if reading an entire table takes 500, if I start 4 threads to read 4 differents portions of the table, the 500 result should get smaller right? Like not divided by four but at least 50% gain?
Aug 14, 2013 at 1:53 PM
It depends upon your HDD speed, it has only one head to serve all your threads. For example, I have 8 cores CPU and fast HDD. When I insert into one table I get 500K records per second, When I use 6 threads to write into 6 different tables, located on the same HDD, I achieve 1.7MLN records per second, what gives me maximum of 40MB/sec of HDD write. Adding threads will not help me, but if I put different tables on 6 different HDDs, I would achieve 3MLN inserts per second for 6 threads.
Aug 14, 2013 at 1:56 PM
Well I was expecting something similar to your results using one HDD. And I don't read much data.

I'm splitting into buckets right now, I'll share the results.
Aug 14, 2013 at 3:02 PM
Using 10000 items buckets, I get a result of 450ms reading from 8 tables. 20000 items buckets is the same result. It's good!

Now I'll look if my inserts spped are affected and if not, I don't think I could get any faster unless you provide a flat table structure to store data :D
Aug 14, 2013 at 3:15 PM
Good!

Flat table...what is that? List<object>? If you don't need very big table.... take a List serialize it and use InsertDataBlock.
Aug 14, 2013 at 3:17 PM
Edited Aug 14, 2013 at 3:21 PM
dlamb3rt wrote:
The only thing I think I would enjoy especially in my case, is a simple table without keys so not in a tree. Enumerating the content would not need any traversal. Which should be much faster. I'm pretty sure I know what you're gonna answer but as we can see in my case is that I store my data and then I store myself the pointers using indexes in any design.
Yeah that's what I meant there, like a list. But to work like a table without a tree structure and keys, and access that table's entries by pointer only.
Aug 14, 2013 at 3:33 PM
You can make it yourself.
1 file - index.
2 file data.

1 file.
8byte pointer to the data, 8 byte data length. These dublettes come one after another. If we need to get index 5, we must 5*16 = 80.
Starting from byte 80 read 16 bytes for the 2 file pointer and data.

2 file just data.

Also it's possible to add deleted flag in first file.
Such stuff can be accessed by index (0-n) or traversed from the index A till index B forward/backward.

May be sometimes can be useful for someone.
But DBreeze repeats this functionality using buckets and can have competing speed (having all these transactions and other stuff inside).
Aug 14, 2013 at 3:38 PM
hhblaze wrote:
But DBreeze repeats this functionality using buckets and can have competing speed (having all these transactions and other stuff inside).
What do you mean? Traversal speed depends on the index type and dispersal no? If there is no such factor, it should be faster in any case.
Aug 14, 2013 at 3:53 PM
Edited Aug 14, 2013 at 3:53 PM
If table is flat, index MUST be 0,1,2,3,4,5,6.... otherwise you don't know how to get elements by index.
For DBreeze such index 0,1,2,3,4,5,6.... has also the best possible distribution, but DBreeze can work with different index types. Buckets will give even bigger traversal speed gain, but not the random select speed.

For me speed between DBreeze and such flat table is insignificant. If for you it is so critical, you know what to do :)
Aug 14, 2013 at 4:51 PM
Alright. Thanks for the advice. I probably don't realize the slight difference of performance between the two.

Thanks a lot for the help.