SelectForward exception.

Apr 9, 2013 at 1:48 PM
Edited Apr 9, 2013 at 1:49 PM
Hi.
System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at DBreeze.Utils.BytesProcessing.To_UInt64_BigEndian(Byte[] value)
   at DBreeze.DataTypes.DataTypesConvertor.ConvertBack[TData](Byte[] dt)
   at DBreeze.DataTypes.Row`2..ctor(LTrieRow row, LTrie masterTrie, Boolean useCache)
   at DBreeze.Transactions.Transaction.<SelectForward>d__23`2.MoveNext()
This started happening on:

-> foreach (Row<ulong, byte[]> row in transaction.SelectForward<ulong, byte[]> ("recordsTableName"))
{
..
}

Any ideas ? I do however suspect the machine was forcibly restarted, but I have to confirm it.
Coordinator
Apr 9, 2013 at 2:08 PM
Previous versions <47 could corrupt structure.
And you could not notice that, till specific region selection.
If you've started this table from ver. 47 - then we have to try to repeat this issue (by emulating the same save strategy as in your procedure) - there is no other way to find the reason.
If table evolves from previous DBreeze versions, just try to repack it (Forward up to mistake, backward up to mistake - smth will be lost) and go on further observing.
Apr 9, 2013 at 2:19 PM
The version i used was: 01.046.20130304 So was since last we spoke something changed ?
Coordinator
Apr 9, 2013 at 2:24 PM
Oh, yes, that's why I insist on subscribing to the project... Version 47 is released, read "changes in version document".
Apr 9, 2013 at 7:29 PM
Edited Apr 9, 2013 at 8:44 PM
I've changed to the new library, however I am getting some funny store times,

records=4939, Time=79744.6ms

Code i use:
for (i = 0; i < rawRecords.Length; ++i)
   {
    ++id;
    transaction.Insert<byte[], ulong> ("guids", records[i].Guid.ToByteArray (), id);
    transaction.Insert<ulong, byte[]> ("records", id, rawRecords[i]);
   }
First record takes 16 bytes and second between 40-60 bytes.

Any idea why it take 79 seconds to store 4939 records ?


I used DBreeze_01_047_20130402_NET40_MONO. I thought this one was to be used for .NET 4 framework.


When I switched to DBreeze_01_047_20130402_NET35 everything went back to normal. However why is it
so different if I use this version ? I would rather use .net 4 version.
Apr 9, 2013 at 8:49 PM
Edited Apr 9, 2013 at 9:37 PM
Yup, I've tried compiling your latest version with #define NET40 global in project settings and target framework 4. When i use this library,
things are slow as hell, above code produced: records=4939, Time=113227.5ms.
Same as before.
Coordinator
Apr 10, 2013 at 6:49 AM
Edited Apr 10, 2013 at 7:24 AM
You are confusing me with such big speed difference between 35 and 40 - it needs further investigation.

Also I must tell that speed of _NET40_MONO probably can be closer to reality. Because you insert random keys on the mechanical HardDrive.
May be this article can help to understand Guid Primary Keys.

Speed of such insert (2MLN rec inserts, keys are growing in different manner) with NET40_MONO 9 sec with NET35 - 6sec:
DBreezeEngine engine = null;
        private void button1_Click(object sender, EventArgs e)
        {
            if (engine == null)
            {
                engine = new DBreezeEngine(new DBreezeConfiguration
                {
                    DBreezeDataFolderName = @"E:\temp\DBreezeTest\DBR2"
                });
            }

            
            using (var tran = engine.GetTransaction())
            {
                DBreeze.Diagnostic.SpeedStatistic.StartCounter("a");

                DateTime dt = new DateTime(1970, 1, 1);
                for(int i=0;i<1000000;i++)
                {
                    tran.Insert<int,byte[]>("t1",i,new byte[] {1,2,3});
                    dt = dt.AddSeconds(7);
                    tran.Insert<DateTime,byte[]>("t2",dt,new byte[] {1,2,3});
                }

                tran.Commit();
            }

            DBreeze.Diagnostic.SpeedStatistic.PrintOut("a",true);
            Console.WriteLine("DONE");

        }
    }
Coordinator
Apr 10, 2013 at 7:17 AM
Edited Apr 10, 2013 at 1:08 PM
Also, to achieve performance in random bulk inserts with DBreeze, we must sort keys in memory before supplying them to insert. From DBreeze point of view, sorted means "sorted like strings", in case of complex byte[], which is not a "standard" data type like int or DateTime.

If key is complex byte[] and non-standard data type:

In Dbreeze.Utils we have byte[] extensions.

byte[] b;
string sb = b.ToBytesString("");

sb will contain string representation of a byte[]...
Put all your keys in Dictionary<string,... and apply LINQ.OrderBy to get correct sequence for tran.Insert.

If key is standard like DateTime:

Dictionary<DateTime,valtype> _d
foreach(var kvp in _d.OrderBy(r=>r.Key))
{
tran.Insert<DateTime,valtype>("t1",kvp.Key,kvp.Val)
}
tran.Commit();
Apr 10, 2013 at 9:04 PM
Edited Apr 10, 2013 at 10:56 PM
Problem is my keys are guids. And i did sort them like You said (i checked in debugger they were sorted in ascending order). But after i've inserted 10k into database it again slows down. Maybe because of entropy of guids keys ?
Coordinator
Apr 11, 2013 at 4:33 AM
Please, publish your code...if you do experiments...
I dont know the character of yor inserts...how big is insert block quantity...why it slows down after 10k not after 12...

To reduce, or even get rid of entropy try to create guids yorself.

Algorithm must be following.
First 8 bytes Datetime.utcnow.ticks.to8bytesaray (typing from phone..sorry for syntax) the rest - random byte array.
Concat all in one big byte array, datetime to byte array - function from dbreeze utils.
Uniqness will be guaranteed..plus keys will be already sorted by creation datetime
and there will be no entropy at all. Please, publish results and testing code.
Coordinator
Apr 11, 2013 at 8:28 AM
Edited Apr 11, 2013 at 8:28 AM
Materializing previous concept.
using DBreeze;
using DBreeze.Utils;


DBreezeEngine engine = null;

        private void InitEngine()
        {
            if (engine == null)
            {
                engine = new DBreezeEngine(new DBreezeConfiguration()
                {
                    DBreezeDataFolderName = @"D:\temp\DBreezeTest"
                });
            }
        }

ushort guidUp = 0;
        private byte[] GenerateGuid24()
        {
            guidUp++;
            Random rnd=new Random();
            byte[] bt=new byte[24-8-2];
            rnd.NextBytes(bt);
            return DateTime.UtcNow.To_8_bytes_array().ConcatMany(guidUp.To_2_bytes_array_BigEndian(),bt);
        }

private void Test4()
        {           
            InitEngine();
            byte[] key = null;
            byte[] val = new byte[] { 1, 2, 3 };

            for (int j = 0; j < 200; j++)
            {
                DBreeze.Diagnostic.SpeedStatistic.StartCounter("a");
                using (var tran = engine.GetTransaction())
                {
                    Console.WriteLine(tran.Count("t1"));

                    for (int i = 0; i < 1000; i++)
                    {                        
                        key = GenerateGuid24();

                        tran.Insert<byte[], byte[]>("t1", key, val);
                    }

                    tran.Commit();
                }
                DBreeze.Diagnostic.SpeedStatistic.PrintOut("a", true);
            }
        }
DBR 47 NET40-MONO
Execution Results

a: 1; Time: 322 ms; 880847 ticks
184000
a: 1; Time: 300 ms; 822029 ticks
185000
a: 1; Time: 255 ms; 697186 ticks
186000
a: 1; Time: 255 ms; 699202 ticks
187000
a: 1; Time: 247 ms; 675225 ticks
188000
a: 1; Time: 263 ms; 721204 ticks
....


So, Every 1000 block insert takes 250-300ms... This is a "controllable" guid...
Unfortunately, for the guids with "high entropy" (probably it's exactly your case)...
must be developed specific approach, which must come from the business logic.
If you explain, in details, what are these guids, how they appear and the character of their appearance - we can try to think ab. this.
Coordinator
Apr 11, 2013 at 8:40 AM
It's not implemented, but there is a theoretical possibility to configure DBreeze NOT TO update TrieNodes, but always write them into the end of the file.
In this case in-memory buffer will be switched on, also we don't need rollback writes, and the "bare metal", in case of bulk insert/remove/update, will be touched rarely.
Definitely, it will make file bigger. But we can use this approach only for pattern-based selected entities.
This same concerns updates of the values.
Coordinator
Apr 11, 2013 at 8:53 AM
Also, switching OFF previously described updates is interesting only in case, when we have bulk data to insert and this speed is important for the watching human, for example :). If it's a crawler of urls, it can store all data into flat file, then once per some hours, take all data into memory, sort it and put into DB... so approaches can be different.
Coordinator
Apr 14, 2013 at 11:17 AM
Edited Apr 14, 2013 at 1:11 PM
Made experiment with RANDOMLY inserting entities.

So, idea is to add search-trie nodes always in the end of file and never overwrite them. So we can get rid of rollback file writings and overwriting data file parts.
It must give us huge speed boost and big HDD free space degrade (it can be solved by compaction later).

This approach can be interesting if we need to save quickly big chunks of random keys in bulk.

Emulation of random keys:

This example demonstrates current 47 DBreeze version (INDUSTRIAL settings: writing through OS cache directly to the disk).
Inserting 500K of random keys in 1K Commit chunk.
Console.WriteLine("INSERT RANDOM");
            Random rnd = new Random();
            byte[] bt = new byte[20];           
            Dictionary<string, byte[]> _d = new Dictionary<string, byte[]>();
            byte[] bt1=null;
            double ms = 0;

            DateTime dt = new DateTime(1970, 1, 1);

            for(int i=0;i<500;i++)
            {
                _d.Clear();

                for (int j = 0; j < 1000; j++)
                {
                    rnd.NextBytes(bt);
                    bt1 = new byte[20];
                    Buffer.BlockCopy(bt, 0, bt1, 0, 20);
                    _d.Add(bt.ToBytesString(""), bt1);
                }

                DBreeze.Diagnostic.SpeedStatistic.StartCounter("a");
                using (var tran = engine.GetTransaction())
                {
                    foreach (var el in _d.OrderBy(r => r.Key))
                    {

                        tran.Insert<byte[], byte[]>("t1", el.Value, new byte[] { 1, 2 });
                    }
                    tran.Commit();
                }
                Console.WriteLine("ITER: " + i);
                ms += DBreeze.Diagnostic.SpeedStatistic.GetCounter("a").ElapsedMs;
                DBreeze.Diagnostic.SpeedStatistic.PrintOut("a", true);
            }
HDD
Result middle speed of 1000 keys insert is 15sec. File size is 25MB. Total insert time = 2 hours
SSD
Result middle speed of 1000 keys insert is 4sec. File size is 25MB. Total insert time = 33 min


Changing DBreeze to write Trie nodes to the end of file. - this is not integrated yet, I made it programmatically - if all is good, we will integrate config into next version.

In file LTrieGeneratinoNode.cs in proc WriteSelf we remark condition if (reservation > this.QuantityReservationSlots) and 1 line before bool OverWrite = true; set to false.

Inserting 500K of random keys in 1K Commit chunk.
WE DO MAKE IN MEMORY SORTING OF CHUNK KEYS BEFORE INSERT. If we DONT we got file 4 times bigger.

HDD
Result middle speed of 1000 keys insert is 214 ms. File size is 226MB (452 bytes/record). Total insert time = 1.5 min.

Compaction of the table (just read SelectForward and insert into other table was finished for 9 seconds and created file of size 18MB).


I wait from you comments about this feature.
Coordinator
Apr 14, 2013 at 4:55 PM
So all thoughts are implemented in version 48 (Read docu, grab sources and test) - if all ok, I can publish it for the community.
Apr 14, 2013 at 7:56 PM
I had very little time lately to dedicate to programming, I'll study what You wrote and test what I can and let You know asap.
Coordinator
Apr 15, 2013 at 3:47 PM
I have released new version and updated documentation.
Apr 16, 2013 at 2:03 AM
Also what do You think about another overload of:

public void Insert<TKey, TValue> (string tableName, TKey key, TValue value, out byte[] refToInsertedValue, out bool WasUpdated, bool overwrite);

where duplicate key is not overwritten if it already exists ?
Coordinator
Apr 16, 2013 at 7:00 AM
Due to internal DBreeze save algorithm, it has no sense (and is not possible).
The fact is that, when you say insert - it doesn't mean that it will be saved write now.
May be yes, but may be not - everything depends upon your next instruction.
That's the reason why you can not
turn on and then turn off overwrite setting for one table inside of one transaction.
Apr 16, 2013 at 1:18 PM
Edited Apr 16, 2013 at 1:30 PM
Em so in bulk inserts what is the best way to do it select each key first then insert if it doesn't exist ? Or any more elegant method exists ? I could think of only one rollback if insert indicates that record was update with:
public void Insert<TKey, TValue>(string tableName, TKey key, TValue value, out byte[] refToInsertedValue, out bool WasUpdated)
Coordinator
Apr 16, 2013 at 1:43 PM
Edited Apr 16, 2013 at 1:44 PM
I didn't understand your question really.

out bool WasUpdated - just indicates (gives you extra information) that your inserted value existed before, so, update took place.

Concerning our previous discussions about speeding up of updates and random keys inserts, use such technique:
  • open transaction
  • call tran.Technical_SetTable_OverwriteIsNotAllowed(tableName) (for each table where u want to turn off overwrite setting and speed up process)
  • make batch inserts/removes/updates
  • call commit
  • close transaction
Apr 16, 2013 at 1:54 PM
Ok i will test about speed.


About what i asked, what if i wanna insert bulk amount of records, but don't want to overwrite if any key already exists.
Coordinator
Apr 16, 2013 at 2:02 PM
Edited Apr 16, 2013 at 2:05 PM
krome wrote:
About what i asked, what if i wanna insert bulk amount of records, but don't want to overwrite if any key already exists.
For now there is no extra parameter for the insert statement to control such behaviour....but theoretically possible.

Today we make it this way:

Dictionary<int,int> valuesToInsert =new Dictionary...
using (var tran = DBEngine.GetTransaction())
            {
                 foreach(var kvp in valuesToInsert.OrderBy(r=>r.Key) )   //Keys are sorted ascending
                 {
                       var row = tran.Select<int, int>("t1", kvp.Key, true);   //!!!!!NOTE, asReadVisibilityScope is used (it's like second channel, which also speeds up process)
                       if(!row.Exists)
                             tran.Insert<int,int>("t1",kvp.Key,kvp.Value);
                  }
                   tran.Commit();
            }
Apr 18, 2013 at 12:55 AM
Sorry can You explain what asReadVisibilityScope is ? I don't quite understand what documentations says about it.
Coordinator
Apr 18, 2013 at 11:12 AM
Think about this like about second instance of a search trie, which doesn't interfere with an instance which must write to disk (insert).
Due to both instances are served in parallel they work faster. It's not possible to create second instances within transaction, except using this asReadVisibilityScope flag.
There can be maximum 2 instances serving one table inside one transaction (1 reading and 1 writing).

If you only write, always one instances will be used for that withing one transaction.
If you only read then only one instance will be used.
If you read and write also one instance will be used (except you setup asReadVisibilityScope)

smth. like this.
May 25, 2013 at 3:48 PM
Could you look at this, why is it slow ?

Two nested tables. (One contains records), (the othercontains guids as keys, id-s of record table as values).
Problem is not removal. I use sequential guids same as that GUID guide You posted above. And they're not problematic.
The main slowdown is in this:
for (i = 0; i < sortedGuids.Count; ++i)
   {
    rawGuid = sortedGuids[i];
    row = dt.Guid.Select<byte[], ulong> (rawGuid, true);
    if (row.Exists)
    {
     id = row.Value;
     keys.Add (id);
Where I select each guid's value.
public void PurgeRecordsByGuid (Transaction transaction, VipAddress address, IEnumerable<Guid> guids)
  {
   int i;
   ulong id;
   ulong count;
   bool removed;
   byte[] rawGuid;
   string tableName;
   List<ulong> keys;
   List<byte[]> sortedGuids;
   Row<byte[], ulong> row;
   Stopwatch sw;
   DataTables dt;
   Dictionary<string, byte[]> dictionary;

   sw = new Stopwatch ();
   keys = new List<ulong> ();
   dictionary = new Dictionary<string, byte[]> (guids.Count ());
   tableName = FetchPhysicalTableName (PhysicalTable.DATA_TABLE, address);
   dt = new DataTables (transaction, tableName);
   dt.Data.Technical_SetTable_OverwriteIsNotAllowed ();
   dt.Guid.Technical_SetTable_OverwriteIsNotAllowed ();

   sw.Start ();
   foreach (Guid guid in guids)
   {
    rawGuid = guid.ToByteArray ();
    dictionary.Add (rawGuid.ToBytesString (), rawGuid);
   }
   sw.Stop ();
   LOG.DebugFormat ("Adding guids took {0}ms.", sw.ElapsedMilliseconds);

   sw.Restart ();
   sortedGuids = dictionary.OrderBy (kvp => kvp.Key).Select (kvp => kvp.Value).ToList<byte[]> ();
   sw.Stop ();
   LOG.DebugFormat ("Sorting guids took {0}ms.", sw.ElapsedMilliseconds);

   sw.Restart ();
   for (i = 0; i < sortedGuids.Count; ++i)
   {
    rawGuid = sortedGuids[i];
    row = dt.Guid.Select<byte[], ulong> (rawGuid, true);
    if (row.Exists)
    {
     id = row.Value;
     keys.Add (id);
     dt.Guid.RemoveKey<byte[]> (rawGuid);
    }
   }
   sw.Stop ();
   LOG.DebugFormat ("Removing guids took {0}ms.", sw.ElapsedMilliseconds);

   sw.Restart ();
   keys = keys.OrderBy (k => k).ToList ();
   sw.Stop ();
   LOG.DebugFormat ("Sorting record ids took {0}ms.", sw.ElapsedMilliseconds);

   sw.Restart ();
   for (i = 0, count = 0; i < keys.Count; ++i)
   {
    dt.Data.RemoveKey<ulong> (keys[i], out removed);
    count += (removed ? 1UL : 0UL);
   }
   sw.Stop ();
   LOG.DebugFormat ("Removing records took {0}ms.", sw.ElapsedMilliseconds);
   dt.UpdateRemovedStatistics (count);
  }
Coordinator
May 25, 2013 at 4:19 PM
So I made such experiment:

In procedure testC14() I fill nested table with 100K random guids.
In procedure testC15() I read and remove all 100K.

Time Result is:
testC14 = 1523 ms = 1.5 sec
testC15 = 6700 ms = 6.7 sec
FileSize = 4.8 MB

I find this good.

Here are procedures:
private void testC14()
        {
            byte[] btGuid = null;

            Dictionary<string, byte[]> _d = new Dictionary<string, byte[]>();


            for (int i = 0; i < 100000; i++)
            {
                btGuid = Guid.NewGuid().ToByteArray();

                _d.Add(btGuid.ToBytesString(""), btGuid);
            }


            System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
            sw.Start();

            using (var tran = engine.GetTransaction())
            {
                NestedTable nt = null;
                nt = tran.InsertTable<int>("t1", 1, 0);

                foreach (var de in _d.OrderBy(r => r.Key))
                {
                    nt.Insert<byte[], byte>(de.Value, 1);
                }


                tran.Commit();

                Console.WriteLine("INSERTED: " + nt.Count());
            }

            sw.Stop();
            Console.WriteLine("Consumed: {0}", sw.ElapsedMilliseconds);

        }


private void testC15()
        {
            Dictionary<string, byte[]> _d = new Dictionary<string, byte[]>();

            using (var tran = engine.GetTransaction())
            {
                NestedTable nt = null;
                nt = tran.SelectTable<int>("t1", 1, 0);
                foreach (var row in nt.SelectForward<byte[], byte>())
                {
                    _d.Add(row.Key.ToBytesString(""), row.Key);
                }
            }


            System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
            sw.Start();

            using (var tran = engine.GetTransaction())
            {
             //   tran.Technical_SetTable_OverwriteIsNotAllowed("t1");

                NestedTable nt = null;
                nt = tran.InsertTable<int>("t1", 1, 0);
                nt.Technical_SetTable_OverwriteIsNotAllowed();

                foreach (var de in _d.OrderBy(r => r.Key))
                {
                    var row = nt.Select<byte[], byte>(de.Value, true);
                    if (row.Exists)
                    {
                        nt.RemoveKey<byte[]>(row.Key);
                    }
                }

                tran.Commit();


                Console.WriteLine("REMOVED: " + nt.Count());
            }

            sw.Stop();
            Console.WriteLine("Consumed: {0}", sw.ElapsedMilliseconds);
        }
Please, check that these procedures make conceptually the same as yours.