• 0

Parallel Hell :(


Question

I will love the person to death that can give me a solution for this issue I'm faced with. I have been at this since 11 am (it's now 9:11 PM my time) and have made little, to no progress, despite scouring every web site I can muster to find. I have a foreach loop that pulls from a client, studies it, and then makes db changes. This has been working out for quite sometime as the lists have always been very small, which leads me into the issue; the problem in itself, is very simple.

 

The problem is that the bigger that list gets, the longer the process takes. So if the list contains 8500 complex objects, the list takes an hour to complete. So I thought to myself, lets make it use the parallel and cut that time down as much as I can. Thus, the torture began.

 

TL;DR:

I have a data table, which by nature is read thread safe but not write thread safe. So I thought well no big deal I'll just lock the table rows when I need to add a new row and I'll be set. The problem is, the only way I can get it to work is if I put the entire body of everything into a lock call, which takes longer then the non-parallel version does as so:

Parallel.ForEach(pGrpLst, kvp =>
                                {
                                    try
                                    {
                                        pGrpM pGrp = kvp.Value;
                                        string baseSku = kvp.Value.BaseSku;
                                        lock (dbStore.MyTbl.Rows.SyncRoot)
                                        {
                                            var productDR = (from DataRow dr in dbStore.MyTbl.Rows
                                                             where dr["product_sku"] != DBNull.Value && (string)dr["product_sku"] == baseSku
                                                             select dr).FirstOrDefault();
                                            if (productDR == null)
                                            {
                                              
                                                    productDR = dbStore.MyTbl.NewRow();
                                                    dbStore.MyTbl.Rows.Add(productDR);
                                                    productDR["product_sku"] = baseSku;
                                                    productDR["product_class_id"] = 1;
                                                
                                            }
                                            productDR["product_name"] = pGrp.ProductName;
                                                productDR[5] = "Test";
                                        }
                                    }
                                    catch (Exception ex)
                                    {
                                        System.Diagnostics.Trace.TraceError(ex.ToString());
                                    }
                                });

It cut off the rest of my post :( but it's got to be b/c the whole lock statement is dragging down the performance. Now if I use this method:

Parallel.ForEach(pGrpLst, kvp =>
                                {
                                    try
                                    {
                                        ProductGrp pGrp = kvp.Value;
                                        string baseSku = kvp.Value.BaseSku;
                                        
                                            var productDR = (from DataRow dr in dbStore.ProductsTbl.Rows
                                                             where dr["product_sku"] != DBNull.Value && (string)dr["product_sku"] == baseSku
                                                             select dr).FirstOrDefault();
                                            if (productDR == null)
                                            {
                                                lock (dbStore.ProductsTbl.Rows.SyncRoot)
                                                {
                                                    productDR = dbStore.ProductsTbl.NewRow();
                                                    newPrdDRLst.Add(productDR);
                                                    productDR["product_sku"] = baseSku;
                                                    productDR["product_class_id"] = 1;
                                                
                                            }
                                        }
                                            else
                                            {
                                                productDR["product_name"] = pGrp.ProductName;
                                                productDR[5] = "Test";
                                            }
                                    }
                                    catch (Exception ex)
                                    {
                                        System.Diagnostics.Trace.TraceError(ex.ToString());
                                    }
                                });

It will work in amazing time about 0.455 seconds. But if I try to modify productDR it fails hard. I've tried many different locations of locks and different objects to lock (locking a generic object, locking the data row item array, the whole data table, etc). Unless I put it all in one big lock body (which defeats the purpose), it fails. Surley there has to be some way to do this? I don't care what order anything gets processed or added. I just want it to get done in the most efficient way possible.

Link to comment
Share on other sites

Recommended Posts

  • 0

I am not an expert at all, but,

- Have you tried one connection per thread?

- Have you looked at how efficient the generated queries are? Maybe calling a database function is faster?

- EDIT: By "fails hard", do you mean on database level or on the code level? I don't think you provided one.

- EDIT: Does your DB backend support "WHERE IN (...)" and mass inserts (instead of one by one)?

Edited by _Alexander
Link to comment
Share on other sites

  • 0

The actual database calls usually only take a fraction of a second. The problem is processing the data to determine what data to modify in the data table. This is what takes time and if I could efficiently use the Parallel library, it would destroy the waiting time. I am really close to a solution but just can't quite get the results I need and am stuck in a rut for getting new ideas on what else to try.

 

The most efficient method (what I posted as a second example) will loop through all iterations 3 times over again in less then 1 second and write the new records, but I can't change an existing row in the data table no matter what I do unless I write ALL of the logic in a lock body statement. This is what I was trying to accomplish here:


Parallel.ForEach(pGrpLst, kvp =>
                                {
                                    try
                                    {
                                        ProductGrp pGrp = kvp.Value;
                                        string baseSku = kvp.Value.BaseSku;
 
                                        var productDR = (from DataRow dr in dbStore.ProductsTbl.Rows
                                                         where dr["product_sku"] != DBNull.Value && (string)dr["product_sku"] == baseSku
                                                         select dr).FirstOrDefault();
                                        if (productDR == null)
                                        {
                                            lock (dbStore.ProductsTbl.Rows.SyncRoot)
                                            {
                                                productDR = dbStore.ProductsTbl.NewRow();
                                                newPrdDRLst.Add(productDR);
                                                productDR["product_sku"] = baseSku;
                                                productDR["product_class_id"] = 1;
 
                                            }
                                        }
                                        else
                                        {
                                            lock (dbStore.PrdToCatTbl.Rows.SyncRoot)
                                            {
                                                productDR["product_name"] = pGrp.ProductName;
                                                productDR[5] = "Test";
                                            }
                                        }
                                    }
                                    catch (Exception ex)
                                    {
                                        System.Diagnostics.Trace.TraceError(ex.ToString());
                                    }
                                });
Link to comment
Share on other sites

  • 0

By "processing the data" you mean INSERT and UPDATE statements?

From my experience inserting and updating one row at a time is slow.

EDIT: I would still try creating a few Connections so that synchronization would be handled by the database and not your code... granted that is via DB Connection/Transaction/Command/Reader which may work differently than this obstruction...

Edited by _Alexander
Link to comment
Share on other sites

  • 0

Well it adds the row to the table or updates the row, but the actual statements get issued at once via an adapter.

EDIT: Is there a better way to do it? I'd be more then happy to hear other options. I'd like not to trash the entire use of adapters and tables if I can help it as I'll have to rewrite 3 thousand lines of code but if it makes sense I'll do it.

Link to comment
Share on other sites

  • 0

I just don't see this as a parallel for issue... I would write the queries directly for performance critical parts...

EDIT: A fellow Neowinian can correct me here, but this doesn't look like it works in batches (IE: Select many, update many, insert many - three DB commands for multiple records).

Link to comment
Share on other sites

  • 0

I hope I am reading this correctly (I am a Java guy myself). But it looks like you are checking for a row in the db and if it exists you do an update. Otherwise you make a new row.

 

If that is the case check that there is an index on product_sku. If not then you are doing a table scan on each iteration. Having the index will speed up each iteration without touching the code.

 

As for the code it looks like when you do the select you can have it select for all the items you want then you can iterate over that.

 

One thing to be careful in your optimization is that you get two processes operating over the same data at the same time then you can run into a situation where both threads look at the db and see no data. Buth will try to add a new row but only the first will succeed. The second one will fail because what it is trying to insert will already be there.

Link to comment
Share on other sites

  • 0

Well remember that the dB changes are a side effect of the actual process. The main overhead is the looping of thousands if product group objects in a dictionary. The loop body checks the dataset for a matching record if a sku, if not found, it adds it. If it is found, it updates it. No other thread will access that row because the sku is unique. So each loop will only look for that one row and add it or update it. Again, this will spawn other independent logic, but one thing at a time.

Link to comment
Share on other sites

  • 0

The parallel issue part is that it works fine with no parallel, it just takes 45 minutes when. I have a loop of 8500 objects. If u parallel add items it takes less then 1 second to process 2500 objects, but I can't update existing rows with the fast execution method without all sorts of exceptions and horrible havoc taking place.

Edit: sorry for wall of text. On cell phone and its hard to type with.

Link to comment
Share on other sites

  • 0

Oh boy, all sorts of potential problems here, but I'm not going to start making guesses without knowing a lot more. Lets start with the basic stuff, what are the exception messages you're receiving, and do you have a little more in-depth description of "horrible havoc"?

Link to comment
Share on other sites

  • 0

IndexOutOfBoundsExceptions each time it tries to modify productDR whether I pass the exact column number (ex. productDR[3] = "Test" or product["product_name"] = "Test").

This is the implementation I'd like below, for it to create a new row if one does not exist and modify it if it does.

2wlr7n5.png

Link to comment
Share on other sites

  • 0

Lets try to tackle this in a different direction. Basically, I'm just trying to split the workload of processing each object in the dictionary and making database updates utilizing each core. Anyone have any better ideas on how they would go about doing this? Maybe I should just use regular Tasks for this job. Parallel just seemed like a perfect fit, it had the ForEach built straight in to break up the tasks itself.

Link to comment
Share on other sites

  • 0

Lets try to tackle this in a different direction. Basically, I'm just trying to split the workload of processing each object in the dictionary and making database updates utilizing each core. Anyone have any better ideas on how they would go about doing this? Maybe I should just use regular Tasks for this job. Parallel just seemed like a perfect fit, it had the ForEach built straight in to break up the tasks itself.

 

Part of the problem I believe is that DataTable is not thread safe for writes. (read more here http://stackoverflow.com/questions/9297249/is-parallel-foreachdatatable-asenumerable-thread-safe )

What I would suggest is that you add your changes into two new collections, new and updated - then when the parallel task is complete, go through the data and insert / add in a normal fashion. 

Link to comment
Share on other sites

  • 0

Part of the problem I believe is that DataTable is not thread safe for writes. (read more here http://stackoverflow.com/questions/9297249/is-parallel-foreachdatatable-asenumerable-thread-safe )

What I would suggest is that you add your changes into two new collections, new and updated - then when the parallel task is complete, go through the data and insert / add in a normal fashion. 

 

This would be my suggestion as well.  Any time that you try to modify the thing that you're iterating over, weird things tend to happen.

Link to comment
Share on other sites

  • 0

Shouldn't you be locking the table object itself, not just the rows?

if ( !productDR )
{
    lock ( dbStore.ProductsTbl )
    {
    	...
    }
    ...
}
Alternatively, you could store some kind of reference to the datarow in your dictionary object and keep the two synchronised.
Link to comment
Share on other sites

  • 0

I assume that when they say the DB isn't thread safe for writes it means you can't have a thread writing and any other thread reading or writing to the DB at the same time.

 

So the assumption in your code is wrong because you only lock on writes. You allow a thread to read from the DB while another is writing.

 

You need to surround all parts of your code that write or read to the DB distinctely with a shared ReaderWriterLockSlim. The behavior of this lock is that several threads can acquire it for reading but only one thread can acquire it and strictly exclusively, for writing. So when one thread is writing, no other thread can either read or write. So your code looks something like:

Parallel.For(() =>
     lock_.EnterReadLock();
     // read from db
     lock_.ExitReadLock();
     if // need to write
         lock_.EnterWriteLock();
         // write to db
         lock_.ExitWriteLock();

Note that Select does modify the DB so it has to be inside the write lock as well.

 

So now you get as much parallelism as the inherent thread safety of the DataTable type allows. I strongly suspect that you're IO-bound given that there's apparently no CPU-intensive processing going on so it's likely that you actually have little to gain from parallelization.

 

If memory allows, just cache all the rows you might need in memory, perform all the looping and if logic on in-memory objects, and then commit all the changes in one go to DB.

Link to comment
Share on other sites

  • 0

The sad thing, you could write a sproc that had the logic within it to batch update as required, call that once via your site few seconds later all the work is done.

I'm only showing you the very basics of the function. There is so, so, sooo much more that goes on with this. But I figured if I don't get the basic part of the parallel to work the rest is just trash anyway and I can't expect (no matter how kind you people are here :) ) people on the web to peel through an entire set of 2000 lines or so of code to see how it all ties together. But I do like the idea of two separate lists. I'll give that a shot. Thanks :)

 

EDIT: And correct me if I'm wrong here, but MySQL still doesn't support optional parameters so I really don't know how I'd implement not updating fields to NULL that never got sent from the client. I'd have to write a sproc for each type of combination of updates wouldn't I? I don't know (and I will admit I'm no expert on sprocs, I just have an understanding of them) how I'd make the sproc run re-actively to the data I am presented with.

 

EDIT 2: The only issue I came across when I was thinking about the two list formats is then I'm risking doubling the work for the process anyway though, as if in theory, all of those objects are updates, I'm just paralleling adding a data row to a list that I will just assign the update values to in a single thread. Hm...I wish MySQL supported optional parameters so I could look more into the sproc idea. I use sproc on certain things, just not in that scale yet due to that limitation. I hear there are work arounds, but they're not that great.

Link to comment
Share on other sites

  • 0

Really the first thing to do would be to profile your original method to determine where the bottlenecks are. If you're IO-bound you'll gain little except potential bugs from parallelization. If you're CPU-bound then it's just a matter of locking only around the code that actually accesses the DB and keep the processing separate so it can run in parallel (outside of locks).

Link to comment
Share on other sites

  • 0

So here's the million dollar question. How does MySqlDataAdapter not update all parameters with stored procedures when optional and default values are not supported by MySQL as of this time of writing? If I don't update a field it doesn't get updating with a null value. This just hit me this morning. But it does support using my stored procedures. I've been searching the web for ideas but all of them keep pointing to SQL solutions ><

Link to comment
Share on other sites

  • 0

The only thing I can think of is that it sends all of the data from the rows regardless of whether or not they've been updated, which seems like a waste of extra data being pushed across the wire.

Link to comment
Share on other sites

  • 0

The only thing I can think of is that it sends all of the data from the rows regardless of whether or not they've been updated, which seems like a waste of extra data being pushed across the wire.

Why don't you just eliminate the select query altogether? Store a flag or reference in your dictionary, then do an insert or update depending on if the product isn't null.

Ideally, you should refactor this code and separate the business from the data access logic. The two seem quite muddled here.

Link to comment
Share on other sites

  • 0

I assume that when they say the DB isn't thread safe for writes it means you can't have a thread writing and any other thread reading or writing to the DB at the same time.

 

So the assumption in your code is wrong because you only lock on writes. You allow a thread to read from the DB while another is writing.

 

You need to surround all parts of your code that write or read to the DB distinctely with a shared ReaderWriterLockSlim. The behavior of this lock is that several threads can acquire it for reading but only one thread can acquire it and strictly exclusively, for writing. So when one thread is writing, no other thread can either read or write. So your code looks something like:

Parallel.For(() =>
     lock_.EnterReadLock();
     // read from db
     lock_.ExitReadLock();
     if // need to write
         lock_.EnterWriteLock();
         // write to db
         lock_.ExitWriteLock();

Note that Select does modify the DB so it has to be inside the write lock as well.

 

So now you get as much parallelism as the inherent thread safety of the DataTable type allows. I strongly suspect that you're IO-bound given that there's apparently no CPU-intensive processing going on so it's likely that you actually have little to gain from parallelization.

 

If memory allows, just cache all the rows you might need in memory, perform all the looping and if logic on in-memory objects, and then commit all the changes in one go to DB.

I owe you an apology I did not see this post yesterday for some reason. The query is referencing an in-memory dataset/datatable. That is about as efficient as it gets for fast look ups that you are looping through isn't it?

 

I can't really store a flag that I can think of because the client won't know that the database even exists. I can't change that, I can only control the data coming in. The client basically sends update objects to store in the database. Quite simple really; however, it has a LOT of information and a LOT of objects that get sent that I have to loop through. Then I test to see what is null (b/c if it's null, it didn't get serialized, therefore, that object property is ignored), what isn't null is updated in the database. I'll see if your implementation Andre works any better.

 

Side note: Anytime the data is not in the lock body it throws an error anytime I try to read it's value or change it (the change part makes sense, since that is a write operation, the reading part is odd I feel though). I would think locking the table would allow me to read the product data row value I just queried...the same thread should be obtaining the lock that performed the query o.O.

Link to comment
Share on other sites

  • 0

I owe you an apology I did not see this post yesterday for some reason. The query is referencing an in-memory dataset/datatable. That is about as efficient as it gets for fast look ups that you are looping through isn't it?

Since Select counts as a write operation I'm not sure you actually can parallelize this at all. Then again, in order to solve your performance issue you first need to figure out where your bottleneck is. If DB access is dominant then parallelism won't help you much.

 

If it's the actual processing of the data and the DB isn't thread-safe then perhaps the easiest way would be to have one thread performing all the DB access (ensuring it is sequential), and multiple threads doing the processing without ever touching the DB. You could basically have two blocking queues (see BlockingCollection, ConcurrentQueue), one for work items, one for results. One thread is responsible for reading and writing to DB, produces work items (in-memory copies of DB data) and consumes results (writes them to DB). Other threads (ideally just Tasks running on the thread pool) consume work items and produce results.

 

Glorious drawing to illustrate:

Ir9OLNo.png

 

Still, don't go implementing any of that before having identified that it's actually the processing that's your bottleneck.

  • Like 2
Link to comment
Share on other sites

  • 0

Since Select counts as a write operation I'm not sure you actually can parallelize this at all. Then again, in order to solve your performance issue you first need to figure out where your bottleneck is. If DB access is dominant then parallelism won't help you much.

 

If it's the actual processing of the data and the DB isn't thread-safe then perhaps the easiest way would be to have one thread performing all the DB access (ensuring it is sequential), and multiple threads doing the processing without ever touching the DB. You could basically have two blocking queues (see BlockingCollection, ConcurrentQueue), one for work items, one for results. One thread is responsible for reading and writing to DB, produces work items (in-memory copies of DB data) and consumes results (writes them to DB). Other threads (ideally just Tasks running on the thread pool) consume work items and produce results.

 

Glorious drawing to illustrate:

Ir9OLNo.png

 

Still, don't go implementing any of that before having identified that it's actually the processing that's your bottleneck.

 

Gloriuos drawing - I can see you dabble in design as well as advanced programming. 

 

Joking aside, this seems like a pretty good way to go IF this is indeed the bottleneck of the whole operation.

Link to comment
Share on other sites

This topic is now closed to further replies.