Table consolidation after Bulk Insert.

Feb 27, 2015 at 9:52 PM
Hi.

I am in process to integrate DBreeze in my project, I need to insert from 50K to 1M records in the database, Sometimes those records are updates and with updates DBreeze is killing performance..
So I used Technical_SetTable_OverwriteIsNotAllowed option, but now the DB is growing for each Update, Is there an option to consolidate the database to remove odd keys out out the DB? and shrink the size of it..

Thank you.
Mar 1, 2015 at 12:29 PM
Hi,

Database shrinking, in terms of DBreeze, is always copying all key/values from table1 to table2, removing table1 and renaming table2 to table1.

So you can leave the system to work with "Technical_SetTable_OverwriteIsNotAllowed " during a period of time to get very fast updates.

Than, let say, in 1 mothn/week/year/day just copy your "overgrown" (by very intensive quantity of data and operations) table into new table.
New table will be much much smaller in size, due to new lexicographically-controlled key data allocation.
Copying process must be very fast - some seconds for 1MLN.

You can have question about table access synchronization while table renaming, but this process can be planned in advance in your code for hybrid data flows.
Marked as answer by ErikSimonic on 3/2/2015 at 10:06 AM
Mar 1, 2015 at 12:39 PM
In our own production we define 3 types of data flows.

Automatic.
M2M data flows, which only insert amount of data. We never update tables containing such data.

Manual.
When users inserts and updates information. As a rule, such quantity of data is very small in compare with Automatic.
We always using "Technical_SetTable_OverwriteIsNotAllowed" and never repack such tables - we got plenty of space on HDDs for such "small" amount of data.

Analytical.
E.g Automatic data flows are grouped to get specific analytical views, and such view can be updated.
We think what to with such data in terms of time/size in advance.
We mostly use "Technical_SetTable_OverwriteIsNotAllowed" and sometimes repack such data (may be once per year).
Data can be stored for different time interval based tables: January2014, February2014 etc... separate tables can be easily deleted. Dbreeze "table as a physical file structure" allows this type of shrinking to be native.
Marked as answer by ErikSimonic on 3/2/2015 at 10:06 AM
Mar 2, 2015 at 6:06 PM
Hi thanks for the replay,
I assumed that this was the case, as I tested the DB It was the only logic workflow to use.

Thanks for the explanation.
Apr 3, 2016 at 11:55 AM
hhblaze wrote:
In our own production we define 3 types of data flows.

Automatic.
M2M data flows, which only insert amount of data. We never update tables containing such data.

Manual.
When users inserts and updates information. As a rule, such quantity of data is very small in compare with Automatic.
We always using "Technical_SetTable_OverwriteIsNotAllowed" and never repack such tables - we got plenty of space on HDDs for such "small" amount of data.

Analytical.
E.g Automatic data flows are grouped to get specific analytical views, and such view can be updated.
We think what to with such data in terms of time/size in advance.
We mostly use "Technical_SetTable_OverwriteIsNotAllowed" and sometimes repack such data (may be once per year).
Data can be stored for different time interval based tables: January2014, February2014 etc... separate tables can be easily deleted. Dbreeze "table as a physical file structure" allows this type of shrinking to be native.
Hello,
I am having the exact problem as described by ErikSimonic
Updates are extremely slow unless T.Technical_SetTable_OverwriteIsNotAllowed("t1") is explicitly set.
I would like to know what is the exact code required to 'truncate' the excess data so that only the last updated values remains and the file size is reduced.
You mention copy to another table etc.. Can you provide a sample code? I just need to truncate and maintain recent values after exceeding a certain time and/or file size

Also is it possible (since the data is somewhere in the file) to get the log of the last values of a certain key

From what I can understand Key1, Value9 is appended to the end of the file and treated as the active record, even if Key1 Value8 exists in the file previously.
Is there a way to get all previous Value7 Value 6.. etc before 'truncating' the file?

Appreciate your help
Apr 3, 2016 at 4:21 PM
Hi, can you describe how often you update table and how many records. Which platform and type of your application.
There is no solution to leave only last updated values in a table, while you use speedy update. It's a price.
But once per week/month/year you can copy from t1 to t2 - just selectForward from t1 and insert to t2, then rename t2 to t1.
Apr 4, 2016 at 8:43 AM
Concerning other questions.
Log is also unavailable automatically - we don't need it and we don't use "Technical_SetTable_OverwriteIsNotAllowed" always.


That what I didn't really understand??
From what I can understand Key1, Value9 is appended to the end of the file and treated as the active record, even if Key1 Value8 exists in the file previously.
Is there a way to get all previous Value7 Value 6.. etc before 'truncating' the file?
If you use "Technical_SetTable_OverwriteIsNotAllowed" then all what you insert/update will reside in the end of file. If you don't use "Technical_SetTable_OverwriteIsNotAllowed" then there is a probability of overwrite.
Apr 4, 2016 at 9:20 AM

Good morning,


In our .net application, we maintain a Dictionary (of Long, Object) in memory. Approx 100k records, but can easy go to around 5m in a larger deployment.

This dictionary is populated upon start of the application and is heavily read/write when Live


We need a mechanism to persist the dictionary and the changes (delta)

As of now using traditional sql table in a relational database, which has certain drawbacks. One of which is that each time row exists else insert need to be checked via SQL.

Thought of using DBreeze with a table that mirrors the dictionary T1 of (long, String)


I noticed that to write 100k Long, String keyvalues, it is taking about 20secs (small pc for testing)

With this OverwriteNotAllowed, the speed improves. Takes under .9s for 100k, but still about 8 seconds for a million records.

That is acceptable as the production server is much faster.


On the other hand, I am worried about the file size increasing too much as the rows are getting appended to your database file. Its our application that writes to dbreeze file. No human intervention, hence management/compaction of file has to be done within our application at startup or with a task scheduler.


From what I can understand, this is what is happening in a write (simplistic)


key value Active

101 abracadabra 0

101 abcd 0

101 hello world 1


Read 101 = hello world


My query is

a) how can I remove the inactive records to save space. Is there a single command using some engine or transaction object? I can lock the file till the operation is complete. Multiple users are not accessing the dbreeze file. Its only our application dll calling dbreeze dll

b) While I delete the first 2 records (in example above), can I instead cut paste them to a 'log' data file for archival? Only the inactive records.

That way, the main file size is reduced, while I have the changes logged in a separate archive on tape perhaps.


Thanks,

Sanil







Apr 4, 2016 at 12:26 PM
Edited Apr 4, 2016 at 12:31 PM
So, as I got it.
You have dictionary in-memory under heavy RW and you want not to loose the state of that dictionary, that's why you need persistance.

All depends upon the real speed of IO and importance of your data.

If you don't want to loose any change of the dictionray. I see following scenario. DBreeze and filesystem must be involved.
In one "transaction" (it can be also DBreeze transaction, read docu), you change in-memory dictionary and you write log of that (serialized key/value "plus" size of the KV) change in the end of always open file.
Once per 1/2/3 hour(s) you start persisting procedure (also inside of transaction), who copies dictionary (complete or only changed keys) into database with Technical_SetTable_OverwriteIsNotAllowed and removes/clears log file.

It will bring you to the state that size of the DBreeze table will not grow very fast, cause persistance is quite rare. You don't need to restore the state of the Dictionary from LogFile, but from DBreeze table.

But when your system starts (may be failure occured), first of all, Log file content must be persisted into DBreeze table ("plus" in-memory Dictionary) and cleared out.

Log file write must be ended with Flush()
Apr 4, 2016 at 2:03 PM

Exactly as you described.


Right now we serialize the values of the dictionary and then SQLbulkcopy on the relational table (delete all rows first in same transaction).

It is easier than finding which keys are to be inserted or changed. The dictionary in memory also has new kv pairs getting created during runtime, so its hard to track those as well.

At application launch, the dictionary is populated from the persisted table. Later only deltas need to be updated.


Instead of bulkcopy to relational table, I added those keys from dictionary to DBreeze. With that tuning option ON, the speed was satisfactory.

The query as described in previous post was about the delta and how to compact the table in dbreeze.



Apr 4, 2016 at 2:53 PM
In DBreeze the best way of data compaction is copying from "t1" to "t2", then renaming t2 to t1.
            using (var tran = engine.GetTransaction())
            {
                tran.SynchronizeTables("t1", "t2");

                foreach (var row in tran.SelectForward<byte[], byte[]>("t1"))
                {
                    tran.Insert<byte[], byte[]>("t2", row.Key, row.Value);
                }

                tran.Commit();
            }

            engine.Scheme.DeleteTable("t1");
            engine.Scheme.RenameTable("t2", "t1");
Apr 5, 2016 at 9:13 AM

Yes I saw this code already in the post I commented on.

It would be nice if you encapsulated it within dBreeze in the next version

Something like engine.CompactTable("t1")

I think Access engine has a similar method on mdb files





Apr 9, 2016 at 10:19 PM
Heterogeneous table content makes automatic compaction procedure impossible.
I have told already once about one of the most popular structure we use for storing entities. Where in a root table<byte[],byte[]>, in first row, we store monotonically grown entity ID(long), in second row we store nested table with entity itself e.g. NestedTable<long,T>, in third row we can store secondary index for that entity.