YOUR FEEDBACK
johnpetersen wrote: Great post. You hit some good points, and hopefully me sending this post. It wil...

SYS-CON.TV
TOP MICROSOFT .NET LINKS


Achieving Higher Throughput with SQL Anywhere
Focus on the performance requirement

Six years ago I was asked to implement a data repository to hold network management and performance data. With monitoring systems, databases have a tendency to be large, and have a delicate balance between inserting (logging) new data and deleting old (aging) data out from the system. The SQL Anywhere server database from iAnywhere was what the customer wanted to use, so I first had to evaluate if it could handle the expected load.



It was initially estimated that the solution would need to support 20 to 30 gigabyte databases, where data collection was ongoing and older data was deleted on a rolling window of 30 days. No problem – one gigabyte of data a day in, one gigabyte of data a day out. I easily validated that SQL Anywhere could handle the load. Over the next year, performance requirements doubled, then redoubled again. Currently, I now support many 100-plus gigabyte databases, and have even exceeded 170 gigabytes in a single database server using SQL Anywhere 7.0.4 and 8.0.2. For the purpose of this article, I’ve been running all tests on version 9.0.2 of SQL Anywhere.

In high throughput systems, the database schema will consist of many types of tables, ranging from configuration data, user data, and tables storing the high throughput data. These tables are usually tracking rapidly changing information over time, such as statistics, usage measurements, and changing conditions. Temperature changes from a few thousand sensors, fluid flow information on a gas pipeline, or network traffic statistics come to mind. This article concentrates on data throughput for inserts and deletes, and how to get the most out of your database server.

While there are many different variables that impact throughput performance, this paper focuses on database schema design and tuning the database server with respect to these high throughput tables. This article does not address performance issues related to application architecture or hardware selection. 
 
Database Design – Focus on Performance
When high throughput is required, every decision needs to be thought of with respect to how it will affect performance. Databases are generally very fast at inserting and deleting data; however, as soon as you move away from this core functionality and start using all the great “features” of relational databases, you can easily hamper performance.

In other words, keep these high-performance tables simple. Some good guidelines include:

  1. Keep table width to a minimum: If you want to be fast, you can’t have everything. This will allow you to fit more records onto a database page and use fewer pages overall. Fewer pages means faster throughput.
  2. Use surrogate primary keys with autoincrement: This allows for faster indexing, and in cases where there is a rolling window of time-stamped data, your deletes can be keybased rather than time based, or worse yet multi-field based.
  3. Triggers should be avoided at all cost: Calling a stored procedure when inserting one record every minute or so is not very expensive. We’re shooting for inserting hundreds of rows or more per second.
  4. Avoid constraints that can be moved into middleware: These include not null constraints, default values, and boundary constraints.
 
In systems where you are constantly collecting data, the more you can limit the access of the information, the higher the rate of performance. The savings here are from using a limited set of indexes and foreign keys and educating developers on what data is available and how to access it. If you need only two ways of accessing the dataset and you are supporting seven indexes on a table just in case someone accesses it, you are unnecessarily hurting throughput. The removal of Foreign Key constraints is dangerous, but if you are including the constraint for “completeness” and never actually using the constraint, then it may be a candidate for deletion.  
 
Indexing
Application developers and the database designer responsible for maintaining high performance are usually at odds when it comes to indexing. The number and size of indexes directly affect the insertion rate of a table. When the need for massive throughput exists, the bare minimum of indexes should be used. I have gone as far as to restrict areas of application design in order to maintain a minimum set of indexes. This is a tricky trade-off since adding an application feature that represents 1% of the application could affect 50% of the application when it comes to performance. In the worst case it can make the entire application unusable.
 
A Throughput Example
The example schema below is valid, but contains several of the constructs that can hinder throughput, such as no autoincrement key, column constraints, extra indexes, and extra fields. I will use this to run performance tests of 100,000 records and make corrections on each test run. Each test changes only one aspect of the given design, and the final test applies all the changes. Tests were run an IBM desktop running Windows 2000 with a single Pentium 4 processor and 1.5 Gigabytes of RAM. The disk is a Seagate Barracuda 7200rpm IDE.
  
Results are given in elapsed time for each test. The key is the difference between each test and the baseline test, and not the elapsed time. As with anything, the actual elapsed time can be impacted by a wide range of variables, such as hardware, software, application design and, of course, what we’re most interested in, schema design. 
 
create table SENSOR
(
    SENSOR_ID                       integer                        not null default autoincrement,
    SENSOR_LOCATION      varchar(128),
    LONGITUDE                      varchar(16),
    LATTITUDE                       varchar(16),
    primary key (SENSOR_ID)
);
 
create table TEMPERATURE_DATA
(
    TEMP_TYPE_CODE          varchar(10)     not null,
    SENSOR_ID                       integer              not null,
    COLLECTION_TIME        timestamp                     not null,
    TEMPERATURE                 float,
    HUMIDITY                         float,
    TEMP_STATUS                 varchar(8)                   
         check (TEMP_STATUS is null or ( TEMP_STATUS in ('VALID','INVALID') )),
    MEASUREMENT_TYPE                           varchar(1)                    default 'F'
         check (MEASUREMENT_TYPE is null or ( MEASUREMENT_TYPE in ('F','C') )),
    VAL_1                                float,
    VAL_2                                float,
    VAL_3                                float,
    VAL_4                                float,
    VAL_5                                float,
    VAL_6                                float,
    VAL_7                                float,
    VAL_8                                float,
    VAL_9                                float,
    VAL_10                              float,
    primary key (TEMP_TYPE_CODE, SENSOR_ID, COLLECTION_TIME)
);
 
create table TEMPERATURE_TYPE
(
    TEMP_TYPE_CODE          varchar(10)                  not null,
    TEMP_TYPE_NAME         varchar(10),
    TEMP_TYPE_DISPLAY                            varchar(10),
    primary key (TEMP_TYPE_CODE)
);
 


 
Key things to note from the numbers in Table 1:
  • Often people think there is one fix that will cure all their performance issues, but typically you’ll need to find many small gains. Each of the tests show modest improvements on the insert rates; however, taken altogether we achieved roughly a 33 second or 35% improvement for insertions. Restricting table width and removing column constraints were the two biggest individual gains. Test 4, removing two indexes, showed almost no improvement in insert rates during the tests. In real-life examples I have experienced up to a 5% loss in performance by adding an index. Testing is the only way to truly determine the cost of adding an index.
  • Test 3 showed drastic improvement when deleting data by moving to an autoincrement primary key rather than deleting by timestamp. Without indexes and using the autoincrement, Test 7 showed an even more drastic increase.
SQL Anywhere Options
There are a few options that can be set in the server to further improve throughput, however, you need to consider the importance of your data. Since high throughput systems tend to have fairly benign data, you can be a little more cavalier in commit/rollback settings.
 
Setting the options Delayed_Commits to “On” and Cooperative_Commit to “Off” allows you to streamline throughput to the database. By changing these settings, you are allowing the application to continue processing instead of waiting to find out that the data absolutely got committed. However, you could lose data in the event of a system or hardware failure. Since high throughput data is often expendable, and any catastrophic event like a system or hardware failure typically means I’m going to lose data anyway, I typically change these for the database connections responsible for high-performance tables.
 
Rerunning Test 7 with Delayed_Commits to “On” and Cooperative_Commit to “Off,” the elapsed time drops from 95.5 seconds to 60 seconds, representing an additional 59% gain. 
 
Some Physical Considerations
Most performance problems usually lie in the application design, SQL, and database design. However, the physical choices you make can have a dramatic affect on performance. This area is deserving of a white paper, but a few basic tips include:

  1. A fancy $100,000 server with slow disk throughput can be beaten in performance tests by a $500 system with a $100 IDE disk drive. A big Sun workstation ships with internal disks that typically run at 40Mb per second throughput. For a few hundred dollars, I can easily get a SCSI setup running at 160Mb or 320Mb per second on a cheap personal computer.
     
  2. Most disks are physically formatted at 4096 block sizes – so use a 4096 database page size. Running Test 7 on a database using a 1024 page size, the elapsed time increased from 95.5 seconds to 100 seconds, which is a 5% decrease in performance.
  3. The faster the disks RPM rate, the faster you can write data.
  4. Only use hardware RAID controllers. Software RAID performs horribly.

In my experience, SQL Anywhere on Windows runs faster than Linux, which runs faster than Solaris.
Memory is cheap, so buy a lot. This will not directly help insert and delete rates, but when you start querying, especially with “order by” or “group by” clauses, you can easily force the database server to use Temp space. Temp space is a file on the disk and we want all disk usage to be for inserting and deleting and not to assist selects.
 
Backup, Transaction Logs, and Recovery
Recovery can become a major issue with large databases when not done properly. A power outage causing a restart of the server, for example, will automatically cause a SQL Anywhere server to go into an automatic recovery. In certain instances, I’ve seen the recovery process go on for several days in 7.0.4 and 8.0.2. The database can become inaccessible. Some users have gone as far as to find the dbsrv process and kill it manually. Let’s think about this. The database was killed by accident, which is causing it to repair itself, and their solution was to kill it again, just to make sure that it stands even less chance of automatically recovering?  Doesn’t make a whole lot of sense.

As with any large database, the backup and recovery process is only as good as the plan put together by the database administrator. This plan needs to be based on the individual customer’s needs. If a 10-minute recovery is required with minimal loss of data, you’ll have a different plan than someone who can allow 10 hours maximum downtime and to whom data loss is unimportant. A few things to consider:

  • Use a RAID setup that supports mirroring. Some people think that mirroring setups are costly since they require two sets of disks. In 1980, I would have understood this reluctance to purchase a few extra 40 Gig drives, but today they are pretty cheap. Even my iPod has 60 gigabytes of space.
  • Run an incremental backup strategy. This will conserve space and it will give you better point-in-time recovery.
  • In a recovery situation that goes over a preset time limit, immediately go to a backup rather than wait for the production server to fully recover. Yes, you will lose some data, but you will be up and running quickly.
  • For even better availability, run a redundant server that can be brought up while the production server is in recovery.
 
Benchmark Testing
All the topics discussed above are meaningless unless you can prove that you are helping performance. Given the number of hardware configurations, database designs, application designs, and problems to be solved, the only way you will know how your system performs is via testing.

While developing tools for NASA’s Space Network Test division, each version of the software I was developing was subjected to the “Benchmark Test.” Our team was always nervous during these tests since not meeting the testers’ basic benchmarks for functionality and performance meant immediate failure of our software. Twenty years later, I still use my own benchmark testing with every database architecture I develop.

My SQL Anywhere Tester is very simple and uses a Java driver and jConnect. It times inserts, selects, updates, and deletes. Any time I consider a change to the database server, operating system, or application architecture in the name of performance, I run the benchmark tester, make the change, and run it again in order to see if the change made any difference. I then test all other platforms to ensure it runs consistently on all machines.
 
Conclusion
With proper planning, design, testing, and some self-restraint, SQL Anywhere makes for an excellent database in applications that require high throughput. The key is remaining focused on the performance requirement and working with all developers such that they understand what constraints exist. Keep things simple and control the running environment as much as possible, and you can consistently hit the high throughput numbers you need.
 
About Todd Loomis
Todd Loomis is an independent consultant who specializes in database design and development frameworks. He has designed systems in many areas to include the aerospace, legal, financial, energy, and computer networking industries. Todd has been developing and supporting products using SQL Anywhere as an embedded database for 6 years. A performance-testing tool is available for free at his Website.

YOUR FEEDBACK
Todd Loomis wrote: The test code I used for the tests discussed in this article, as well as other articles and information is available at my website: www.DatabaseGunForHire.com. Thanks for reading. Todd Loomis
MICROSOFT .NET LATEST STORIES
In a move that looks tailor-made for an antitrust suit, Microsoft says it’s going to give away a consumer security kit that it’s building code named Morro. It should be available in the second half of next year – probably more like mid-year. The freebie widgetry is supposed to de...
OpenSpan and TIBCO have announced a technology and business partnership designed to extend TIBCO solutions to desktop environments. The partnership will enable TIBCO Service-Oriented Architecture, Business Process Management and Business Optimization solutions to more rapidly integrate...
Tidal Software has announced Intersperse 8.0, a product that monitors J2EE and .NET applications and their transaction component performance to produce meaningful metrics for managing applications and high-level business processes. The product leverages a combination of lightweight Ja...
DataGuise has announced their first masking in place solution for multi-database environments such as Oracle, Microsoft SQL Server, and others. The dgSolution Suite provides secure masking of database content and is designed for the highest level of flexibility and functionality across...
The BlackBerryR Technical Webcast Series is designed to help BlackBerry administrators better manage and leverage the capabilities of their BlackBerry solution. Each webcast is packed with detailed technical information, covering topics that are relevant to you. Our on-demand webcasts ...
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON FEATURED WHITEPAPERS

ADS BY GOOGLE
BREAKING NEWS FROM THE WIRES
Collexis Holdings, Inc. (OTC Bulletin Board: CLXS), a leading developer of semantic search and knowl...