Instapaper Text

SQL Server: Text and Document Retrieval

Once you begin using SQL, it’s very natural to just put every data object you have into your database. However, one of the important lessons of SQL is that not all data needs to be “relational”; just because SQL Server can handle virtually any data type doesn’t mean it’s the ideal solution for those data types. This article is about what happens when you need to start storing large volumes of images, text, XML documents, JSON files, or other documents.

The Natural Tendency to “Grow”

Generally, a server-based data project begins by identifying a need for a data storage system and adding an data repository. When you start, you’ll be free to choose between SQL Server, MySQL, CouchDB, and whatever else that’s trendy. You start by looking at all the data you intend to use, and you identify the correct solution for storing that data. For the first few months of the project, all is well, performance is good, and you’re using your data store exactly as you intended.

While your project continues to be a success, you’ll keep adding features. Over the next few months you’ll add new data types, new secondary tables, affiliated data feeds, and so on. This phase of your project likely differs from the beginning: each feature by itself isn’t large enough to justify a radical rewrite. Even if you have enough time to consider complex solutions that involve mixing SQL and NoSQL databases, it’s not a good idea to give the IT management team too many disparate systems to supervise. You shouldn’t just be optimizing just your code, you should be optimizing the systems management requirements and complexity.

I often visit companies who have huge SQL databases and ask them why they were built in that particular fashion. One company might have a terabyte-sized database filled with image files but only a few hundred MB of actual relational data; another firm might have a hundred gigabytes of HTML text stored in a custom content management system. In most cases the companies will tell you that they simply started with the database and kept growing.

Of course, on the other side of the fence are the companies that continuously change their database architecture. These firms are able to rapidly throw out an entire data store when they think the data storage system is sub-optimal, then switch back when circumstances change again. If you have the time available, doing these kinds of refactoring projects can be excellent learning opportunities; but for the rest of us, I have some tips for managing large data growth.

Storing Large Blobs in SQL

Your SQL database has tons of great features for large data support. Given that most software needs to store text, comments, or XML / JSON documents for rapid retrieval, let’s begin with they’re deprecated now, so let’s focus on only this one.

Some of you may recall my previous article where I mentioned that removing varchar() columns was a useful way to improve performance. This is still correct: extremely high performance tables are ones that omit all variable length columns and pack data in extremely densely. Density is the reason you want to take all your variable length text and split it off into a separate system.

As an example, I once received a report that the latest minor point release of our codebase had a terrible performance regression bug. The point release didn’t contain a lot of significant changes, but the database nearly ground to a halt whenever the system retrieved data about a game. Naturally, the first thing you should do is to compare performance before and after the change. I identified one core stored procedure, “get_GameData”, which was the culprit; looking at my application’s debug logs I would see the following:

(old version)
 
  SQL: get_GameData (Time: 00:00:00.0020000)
 
   
 
  (new version)
 
  SQL: get_GameData (Time: 00:00:00.0060000)

The next step was to trace the query’s execution plan. It was a compound query that touched about 10 different tables; all of those tables were fast except for one: the table that listed the game’s release date. Why would that be a performance drain? I decided to take a look:

CREATE TABLE game_release_dates (
 
      release_date_id INT NOT NULL PRIMARY KEY IDENTITY(1,1),
 
      game_id INT NOT NULL,
 
      release_date datetime NULL,
 
      found_by_researcher_id INT NOT NULL,
 
      researcher_comments text NULL -- New column added in this point release
 
  )

Performance problem spotted! Someone had slipped a minor feature into the software to track user comments. However, this “researcher_comments” field was sparse: it was generally null but in a few dozen instances it contained a “copy-paste” of the contents of a publisher press release verifying the release date of the game. Because this table was touched in hundreds of areas, it contributed to a general slowdown.

Use Case: User-Generated Comments

The reason this performance problem existed is because the data was sparse and inefficiently stored. There’s nothing wrong with using varchar(max) for comments, but out of all the dozens of systems that needed to use the release dates table, only one of them – the researcher user interface – needed to see the comments. One option at this point is to migrate all the comments to a completely different system not part of SQL Server; that’s certainly possible, but here’s another alternative pattern, the comments table:

CREATE TABLE user_comments (
 
      comment_id INT NOT NULL PRIMARY KEY IDENTITY(1,1),
 
   
 
      -- This is an ID number of the "table" being commented on - e.g. 153 might refer to "release dates"
 
      table_id INT NOT NULL, 
 
   
 
      -- This is the ID number of the "object" being commented on - e.g. game #124721387
 
      record_id INT NOT NULL, 
 
   
 
      -- This is the User ID writing the comment.  Maybe you want to use an Active Directory account name?
 
      user_id INT NOT NULL,
 
      comment_date datetime NOT NULL,
 
      notes nvarchar(MAX)
 
  )

This is a pattern I’ve been using for about twenty years now. It’s simple, useful, and it allows the comments table to be extremely dense, even though comments may be distributed sparsely. If you only have a few dozen comments, and those comments are only used in a few rare cases, you’ve basically eliminated the overhead for all other systems that don’t require them.

Use Case: Document Retrieval

Another area where SQL is often tasked to help out is document retrieval. You may have a large collection of XML documents – perhaps they represent game turns, or content management system text, or zone data files. If you already have SQL Server in place it’s likely more efficient to use SQL Server to index these documents rather than to add a secondary system dedicated to key-value-pair document retrieval.

That advantage is likely to be squandered though if you actually store the data directly in SQL Server. Your database server is powerful only as long as it has free RAM available to cache high-frequency data. Here’s what the bottleneck problem looks like:

Sequence diagram of SQL Server and images

Why would you store your images here?

The critical performance improvement you can make is to store big objects elsewhere, and index them in your database. We can use secondary data stores for anything that’s going to be 1) Large, 2) Unstructured, and 3) Non-relational. Ideally, for large objects like these – especially for images, XML, JSON, and so on, you want to cut out all your internal overhead. You can store this data directly on a high performance filesystem and identify each document using a URL schema like this:

http://cdn.mycompany.com/images/items/{item_id}.jpg
http://api.mycompany.com/gamedata/{game_id}.xml

This way, you only need to retrieve the integer item ID from the database and you can tell the client application how to retrieve that object directly. Some people prefer they’re 50% less storage efficient than 64-bit ints. Unless I have a specific need for GUIDs, I tend to go for 64-bit int objects.

Type of Data	Requirements	Secondary Storage
Item thumbnails	Rarely changed, but accessed continuously	Use a CDN, like Akamai, AT&T, or EdgeCast. These are the most effective rapid delivery mechanisms for small files.
User created images such as character images, icons, guild logos, etc.	Changed frequently, accessed regularly	this guy’s solution.
XML or JSON data for multiplayer games	Changed dozens of times over the duration of a game, accessed once for each revision	Grant your front-end web servers access to a shared redundant data store like Amazon S3, and enable reads and writes to that location. Store each file using a GUID filename. Create a back-end pruning task to eliminate data when it exceeds your age threshold.

A schema for a relational SQL table that links to a back-end secondary data store should look something like this:

CREATE TABLE items (
 
      item_id INT NOT NULL PRIMARY KEY IDENTITY(1, 1),
 
      item_name nvarchar(100) NULL,
 
      item_class_id INT,
 
      item_thumbnail_id BIGINT, -- or GUID
 
      item_fullimage_id BIGINT -- or GUID
 
  )

With this in mind, you will keep your database server working on the small subset of your data that must be relational, and leave the large volume bit pushing to the content distribution network. When it comes time to link to an item thumbnail or full image, you can just generate the URL and retrieve it from the CDN. Alternatively, if you’re writing a client side application, write a startup script that checks a local cache against all images on your CDN at once.

Where do you go next?

Pretty much all the tricks I’ve shared today are related to moving data aside so it doesn’t impact critical high performance systems. When you focus on building extremely dense data tables, SQL Server can become almost as fast as constructing your own B-Tree in memory (because that’s actually what SQL Server does behind the scenes). As long as you keep in mind the side effects of your design choices, you should be able to continue to tune your database for extremely high performance.

#AltDevBlogADay

Ted Spence