I have an app that needs to upload a lot of files into my DB, where each file will typically be perhaps 400K but can max out at 16MB (though that would be very rare). Can anyone comment on whether it will heavily tax my system if I wrap each file's upload in its own transaction. There will typically be a few dozen files being simultaneously uploaded by my users or less (usually at the end of each month), but it can theoretically be several hundred or even into the thousands (though highly improbable in either case). Can anyone offer any insight into whether so many transactions will cause a problem. I'm using SQL Server 2008. Thank you.
What is the point in wrapping each file upload into transaction? Each INSERT is atomic transaction already, lasting the length of insert.
If you always had 3 related files coming and you needed to upload them as unit, and ensure that all are loaded or all fail, then explicit wrapper transaction is a valid approach.
If you have appropriate size SQL Server, this should be no problem. if most files are small (under 1MB), then table column of (max) type will do. Else, check out FILESTREAM.
> I have an app that needs to upload a lot of files into my DB, where each > file will typically be perhaps 400K but can max out at 16MB (though that > would be very rare). Can anyone comment on whether it will heavily tax my > system if I wrap each file's upload in its own transaction. There will > typically be a few dozen files being simultaneously uploaded by my users > or less (usually at the end of each month), but it can theoretically be > several hundred or even into the thousands (though highly improbable in > either case). Can anyone offer any insight into whether so many > transactions will cause a problem. I'm using SQL Server 2008. Thank you.
> What is the point in wrapping each file upload into transaction? Each > INSERT is atomic transaction already, lasting the length of insert.
> If you always had 3 related files coming and you needed to upload them as > unit, and ensure that all are loaded or all fail, then explicit wrapper > transaction is a valid approach.
> If you have appropriate size SQL Server, this should be no problem. if > most files are small (under 1MB), then table column of (max) type will do. > Else, check out FILESTREAM.
Thanks for the feedback. Each file actually consists of thousands of records, each record targetting an appropriate table. The upload of all records is an all-or-nothing deal. I want to make sure the system's not going to explode if one day several hundred people try to upload their file at the same time.
>> What is the point in wrapping each file upload into transaction? Each >> INSERT is atomic transaction already, lasting the length of insert.
>> If you always had 3 related files coming and you needed to upload them as >> unit, and ensure that all are loaded or all fail, then explicit wrapper >> transaction is a valid approach.
>> If you have appropriate size SQL Server, this should be no problem. if >> most files are small (under 1MB), then table column of (max) type will >> do. >> Else, check out FILESTREAM.
> Thanks for the feedback. Each file actually consists of thousands of > records, each record targetting an appropriate table. The upload of all > records is an all-or-nothing deal. I want to make sure the system's not > going to explode if one day several hundred people try to upload their > file at the same time.
Well how much something like this will affect the performance of the server depends on many factors and is difficult to say without knowing much more about it. But assuming you have properly configured hardware and place the log files on their own Raid 1 or raid 10 it should not be a real problem. But if you undersize the hardware or don't configure it properly there can definitely be bottlenecks.
> I have an app that needs to upload a lot of files into my DB, where each > file will typically be perhaps 400K but can max out at 16MB (though that > would be very rare). Can anyone comment on whether it will heavily tax my > system if I wrap each file's upload in its own transaction. There will > typically be a few dozen files being simultaneously uploaded by my users > or less (usually at the end of each month), but it can theoretically be > several hundred or even into the thousands (though highly improbable in > either case). Can anyone offer any insight into whether so many > transactions will cause a problem. I'm using SQL Server 2008. Thank you.
> I have an app that needs to upload a lot of files into my DB, where each > file will typically be perhaps 400K but can max out at 16MB (though that > would be very rare). Can anyone comment on whether it will heavily tax my > system if I wrap each file's upload in its own transaction. There will > typically be a few dozen files being simultaneously uploaded by my users > or less (usually at the end of each month), but it can theoretically be > several hundred or even into the thousands (though highly improbable in > either case). Can anyone offer any insight into whether so many > transactions will cause a problem. I'm using SQL Server 2008. Thank you.
> You seem to talk about BCP/BULK INSERT type of load. Look at these > commands in details. > Check ROWS_PER_BATCH value; it might help you.
Thanks. I'm actually using the "SqlBulkCopy" class (assuming you're familiar with .NET) and this presumably wraps BULK INSERT (still researching things). I also took a quick look at ROWS_PER_BATCH and this may prove useful (thanks for pointing this out). If possible however, I'd like to insert all data in one transaction so it's still unclear to me whether it's going to be a potential problem.
> Well how much something like this will affect the performance of the > server depends on many factors and is difficult to say without knowing > much more about it.
Thanks for you help and I agree. It's hard to appreciate when you're not really familiar with the details. In this case however it's a fairly simple situation. Assuming 10,000 customers in the short-term, 100,000 long term, and each customer will be uploading a single file just once every few weeks (typically 400K or so but the file size can vary, sometimes smaller, sometimes larger, but never exceeding 16MB), there will be times when multiple users will be uploading simultaneously. Upload times are random however (whenever the customer wants to do it), so the chances of many simultaneous uploads is relatively small. Realistically you wouldn't expect too many of them in practice (to be doing uploading at very same moment) but anything's possible of course. The app simply needs to read all the records in the file and do a bulk insert into a single table (that's the simple story). I want to read all the records into memory and then insert them in one bulk operation using a single transaction (16 MB at worst - I can always write these records out to a file and import them from there if necessary - the original upload file itself isn't suitable for this since the records first need to be constructed from raw data in the file).
> But assuming you have properly configured hardware and place the log files > on their own Raid 1 or raid 10 it should not be a real problem. But if you > undersize the hardware or don't configure it properly there can definitely > be bottlenecks.
Given my situation, can you just briefly elaborate on this. I'll do my own investigation of course if you can just point me in the right direction (what kind of configuration are you referring to and can SQL Server actually choke if thousands of uploads simultaneously occur one day - note that I'm a very experienced developer but SQL Server isn't my specialty). Thanks again.
> Hi > As Andrew said to separate LOG and DATA files must be e first step
Thanks. Can you also briefly elaborate on my last response to him (just posted). Appreciate your help.
> Have you considered using SSIS Package to insert the data from the files , > it hs very nice feature named 'fast load'
Not familir with this but probably not necessary for my needs unless it solves some problem specificically related to this issue (resource exhaustion). The bulk insert I tested ("SqlBulkCopy" in .NET) is very fast on its own however. A quick test and I could insert 24,000+ records (1 MB) in less than two seconds. Can't get much better than that.
> Thanks. I'm actually using the "SqlBulkCopy" class (assuming you're > familiar with .NET) and this presumably wraps BULK INSERT (still > researching things).
BULK INSERT is a Transact-SQL statement whereas the SqlBulkCopy class (and its unmanaged ODBC/OLEDB cousins) are native client APIs. Unlike BULK INSERT, the client APIs allow you to load directly from program variables.
>> You seem to talk about BCP/BULK INSERT type of load. Look at these >> commands in details. >> Check ROWS_PER_BATCH value; it might help you.
> Thanks. I'm actually using the "SqlBulkCopy" class (assuming you're > familiar with .NET) and this presumably wraps BULK INSERT (still > researching things). I also took a quick look at ROWS_PER_BATCH and this > may prove useful (thanks for pointing this out). If possible however, I'd > like to insert all data in one transaction so it's still unclear to me > whether it's going to be a potential problem.
> BULK INSERT is a Transact-SQL statement whereas the SqlBulkCopy class (and > its unmanaged ODBC/OLEDB cousins) are native client APIs. Unlike BULK > INSERT, the client APIs allow you to load directly from program variables.
Thanks. I am aware of this but was just speculating that "SqlBulkCopy" is probably a wrapper for BULK INSERT or some other high-volume (native) SQL Server statement (possibly internal to MSFT only). My original issue however is whether running "SqlBulkCopy" (or BULK INSERT directly) will have an adverse on performance if each call is wrapped in its own transaction. In particular, 400K to 16MB of data will be inserted at a time, possibly by dozens or potentially hundreds of users simultaneously (each inserting this amount of data but only once per month typically). Will SQL Server (2008) handle this volume and what can I do to mitigate any potential problems.
Mike23 (_no_spam@_no_spam.com) writes: > Thanks. I am aware of this but was just speculating that "SqlBulkCopy" is > probably a wrapper for BULK INSERT or some other high-volume (native) SQL > Server statement (possibly internal to MSFT only).
More or less. BULK INSERT uses OLE DB, and below the covers there are is a command INSERT BULK, which you can see in a Profiler trace.
But keep in mind that while they are similar, they are to some extent different implementations, and there are subtle differences. Sometimes BCP accepts a file that BULK INSERT barfs at, or vice versa.
> My original issue however is whether running "SqlBulkCopy" (or BULK > INSERT directly) will have an adverse on performance if each call is > wrapped in its own transaction. In particular, 400K to 16MB of data will > be inserted at a time, possibly by dozens or potentially hundreds of > users simultaneously (each inserting this amount of data but only once > per month typically). Will SQL Server (2008) handle this volume and what > can I do to mitigate any potential problems.
Whether you have a BEGIN TRANSACTION around does not matter, as long as you don't set ROWS_PER_BATCH to something. The it is one transaction.
Whether SQL 2008 will cope with depends on the rest of the design. If table has only one single index on an IDENTITY column, no sweat. If you have 10 indexes on the table, and you have 100 users at the same time inserting 16 MB data, the game will be rougher.
I would suggest that rather than asking, run a benchmark. You know the table, the indexes and the available hardware. We don't.
-- Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se
The transaction log entries are crucial to peak performance when it comes to writes. Essentially if one insert is held up due to bottlenecks on writing to the log file so will all other transactions that will also be writing at that time. The more concurrent writes you have the more the potential for a bottleneck to occur. Placing the tran log files on their own physical raid array (usually a RAID 1 or RAID 10) helps to ensure these sequential writes are not affected by the mostly random reads and writes of the data files. You didn't mention what the read load will be like and it can certainly have an effect on overall performance. Ensure you have enough memory available to SQL Server to keep the most active rows in memory and minimize any read disk access. Also ensure that when a checkpoint occurs you have enough spindles on the Raid that the data files will reside to keep up with the I/O requests and you should be fine. The bottom line is when you have heavy write activity ensure you separate your data files from your log files onto different physical (not just logical) arrays and use Raid 10 vs. Raid 5 if possible for the data array. Always use Raid 1 or 10 for the log array.
>> Well how much something like this will affect the performance of the >> server depends on many factors and is difficult to say without knowing >> much more about it.
> Thanks for you help and I agree. It's hard to appreciate when you're not > really familiar with the details. In this case however it's a fairly > simple situation. Assuming 10,000 customers in the short-term, 100,000 > long term, and each customer will be uploading a single file just once > every few weeks (typically 400K or so but the file size can vary, > sometimes smaller, sometimes larger, but never exceeding 16MB), there will > be times when multiple users will be uploading simultaneously. Upload > times are random however (whenever the customer wants to do it), so the > chances of many simultaneous uploads is relatively small. Realistically > you wouldn't expect too many of them in practice (to be doing uploading at > very same moment) but anything's possible of course. The app simply needs > to read all the records in the file and do a bulk insert into a single > table (that's the simple story). I want to read all the records into > memory and then insert them in one bulk operation using a single > transaction (16 MB at worst - I can always write these records out to a > file and import them from there if necessary - the original upload file > itself isn't suitable for this since the records first need to be > constructed from raw data in the file).
>> But assuming you have properly configured hardware and place the log >> files on their own Raid 1 or raid 10 it should not be a real problem. But >> if you undersize the hardware or don't configure it properly there can >> definitely be bottlenecks.
> Given my situation, can you just briefly elaborate on this. I'll do my own > investigation of course if you can just point me in the right direction > (what kind of configuration are you referring to and can SQL Server > actually choke if thousands of uploads simultaneously occur one day - note > that I'm a very experienced developer but SQL Server isn't my specialty). > Thanks again.
>> Thanks. I am aware of this but was just speculating that "SqlBulkCopy" is >> probably a wrapper for BULK INSERT or some other high-volume (native) SQL >> Server statement (possibly internal to MSFT only).
> More or less. BULK INSERT uses OLE DB, and below the covers there are > is a command INSERT BULK, which you can see in a Profiler trace.
Ok thanks, though that makes things even more confusing (BULK INSERT and INSERT BULK - two different statements? Is the latter an internal statement?). It makes no sense though that BULK INSERT would use OLE DB, since OLE DB is just the generic COM (Component Objet Model) used to access a database. There should be an OLE DB provider for SQL Server that sits on top of SQL Server but SQL Server itself would know nothing about OLE DB.
> Whether you have a BEGIN TRANSACTION around does not matter,
So compared to using no transaction at all, do transactions consume a lot of resources (not including disk space which isn't an issue in my case). Is it generally harmless to wrap a lot of data in a single transaction IOW (notwithstanding my own situation where many such transactions may be concurrently executing).
> Whether SQL 2008 will cope with depends on the rest of the design. If > table has only one single index on an IDENTITY column, no sweat. If > you have 10 indexes on the table, and you have 100 users at the same > time inserting 16 MB data, the game will be rougher.
> I would suggest that rather than asking, run a benchmark. You know the > table, the indexes and the available hardware. We don't.
Yes, I will be doing that. I only posted in the first place to get some feedback just in case there were any tripwires that may not surface during testing (not to mention advice on what if any precautions I should take, SQL Server settings I might need to set, etc.). Thanks again for your help.
> The transaction log entries are crucial to peak performance when it comes > to writes. Essentially if one insert is held up due to bottlenecks on > writing to the log file so will all other transactions that will also be > writing at that time. The more concurrent writes you have the more the > potential for a bottleneck to occur. Placing the tran log files on their > own physical raid array (usually a RAID 1 or RAID 10) helps to ensure > these sequential writes are not affected by the mostly random reads and > writes of the data files.
I'm assuming you're referring to the ".ldf" file? This is very good to know. I'll look into it. Thanks.
> You didn't mention what the read load will be like and it can certainly > have an effect on overall performance.
Read load will be relatively small. My main concern is about the upload itself and whether it's relatively harmless to wrap several hundred K or more in a single transaction (since I can also do this without having to wrap everything in a transaction - I can use a flag to indicate when the data is completely uploaded but it's unwieldy and difficult to support).
> Also ensure that when a checkpoint occurs you have enough spindles on the > Raid that the data files will reside to keep up with the I/O requests and > you should be fine. The bottom line is when you have heavy write activity > ensure you separate your data files from your log files onto different > physical (not just logical) arrays and use Raid 10 vs. Raid 5 if possible > for the data array. Always use Raid 1 or 10 for the log array.
Again, will look into it. Thanks very much for the info (greatly appreciated)
Mike23 (_no_spam@_no_spam.com) writes: > Ok thanks, though that makes things even more confusing (BULK INSERT and > INSERT BULK - two different statements? Is the latter an internal > statement?).
The latter is internal.
> It makes no sense though that BULK INSERT would use OLE DB, since OLE DB > is just the generic COM (Component Objet Model) used to access a > database. There should be an OLE DB provider for SQL Server that sits on > top of SQL Server but SQL Server itself would know nothing about OLE > DB.
SQL Server knows a lot by OLE DB, since it uses OLE DB internally. It also uses OLE DB to communicate with linked servers.
> So compared to using no transaction at all, do transactions consume a > lot of resources (not including disk space which isn't an issue in my > case). Is it generally harmless to wrap a lot of data in a single > transaction IOW (notwithstanding my own situation where many such > transactions may be concurrently executing).
You cannot compare "no transaction" to "transaction". Any update you do, implies a transaction, even if you do not use BEGIN TRANSACTION. A single INSERT, DELETE, UPDATE, MERGE or DDL operation is a transaction on its own, and can only be carried out in entirety or not at all. If you need to perform a suite of statements and the entire suite must be atomic, you should wrap it all in BEGIN TRANSACTION.
Yes, a longer transactions will take more resources, or more precisely more locks which can block other users. It can also make the transaction log bigger. But if you need a long transaction, you need a long transaction.
-- Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se
> SQL Server knows a lot by OLE DB, since it uses OLE DB internally. It > also uses OLE DB to communicate with linked servers.
I'm not sure what it means however to say that BULK INSERT (or any other SQL statement) uses OLE DB unless it's internally communicating with some other (internal) layer using OLE DB. Why that extra layer is needed however is unclear to me (other than when it has to communicate with some external data source perhaps). At some point though there must be an implementation of the core SQL Server engine and the code for that will know nothing about OLE DB (which is just a set of COM interfaces). I would have thought that layer would be right below the surface itself (for performance reasons alone).
> You cannot compare "no transaction" to "transaction". Any update you > do, implies a transaction, even if you do not use BEGIN TRANSACTION. > A single INSERT, DELETE, UPDATE, MERGE or DDL operation is a transaction > on its own, and can only be carried out in entirety or not at all. > If you need to perform a suite of statements and the entire suite must > be atomic, you should wrap it all in BEGIN TRANSACTION. > Yes, a longer transactions will take more resources, or more precisely > more locks which can block other users. It can also make the transaction > log bigger. But if you need a long transaction, you need a long > transaction.
Ok, thanks (understood). I'll be doing more research into this area (transactions in general). Appreciate your help.
Mike23 (_no_spam@_no_spam.com) writes: > At some point though there must be an implementation of the core SQL > Server engine and the code for that will know nothing about OLE DB > (which is just a set of COM interfaces).
In fact, the description is that the various components of SQL Server uses OLE DB to pass data between them. After all, as you say, OLE DB is just a set of interfaces, so why not?
-- Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se
> In fact, the description is that the various components of SQL Server > uses OLE DB to pass data between them. After all, as you say, OLE DB is > just a set of interfaces, so why not?
It's not that it can't do things that way. A COM component however is designed to *implement* a published set of COM interfaces (OLE DB in this case). Unless it's delegating that work to other COM components internally (through some additional layer that implements these interfaces for it), it makes no sense to be relying on the very same set of interfaces to communicate with itself (since its job is to implement those interfaces). It just strikes me as odd for instance that BULK INSERT (for example) would be invoking some OLE DB interface to carry out its work when its job is to do that work. I would have thought it simply maps to what proprietary MSFT function exists to handle it (some DLL in the bowels of the SQL Server code).