| By Tyler Jensen | Article Rating: |
|
| March 11, 2004 12:00 AM EST | Reads: |
16,308 |
With the corporate wiki you built using the first article in this series (.NETDJ, Vol. 2, issue 1), you have everything you need to create, edit, and navigate content, but now you need something to help users find what they are looking for. In this article I walk you through creating your own search engine for your corporate wiki.
To review, the wiki is a content creation and navigation hypertext system in which all readers are potential editors - a true content democracy. In my previous article I led you through the necessary steps to create your own wiki. While that article gives you the most basic features required in any wiki (parse, edit, and navigate), most wiki implementations provide four other basic features: search, recently edited topics, like or similar topics, and topic history.
This article will focus on the wiki search engine. The other three features are easy to implement, so I will cover them briefly in the accompanying sidebar. The full source code for the complete Corporate Wiki project and the Wiki Index Service is available for download from www.sys-con.com/dotnet/sourcec.cfm.
In my first wiki implementation, I used the SQL Server full-text index service and the FreeTextTable T-SQL keyword in a query of medium complexity. While the Microsoft Search service and full-text indexing on SQL Server work quite well, producing superbly ranked results, there is one obvious drawback to using it in this case. The search service indexes all rows in a table. The WikiTopic table in my wiki contains all versions of every topic, so the search service would index new and old topic revisions without prejudice. Thus, search results would contain current and obsolete versions of same topic.
Like Recent History, Dude
In this update to the Corporate wiki project, in addition to search I have added three features that are common to most wiki implementations: like, recent, and history pages. Each of these pages adds a different but quick and easy way for the user to find something in which he or she is interested. The code for each of these is rather simple, so I will leave you to download it for yourself from www.sys-con.com/dotnet/sourcec.cfm.
For short list pages such as these, I like to use a simple HTML table with a runat="server" attribute in the table tag to dynamically fill the table with the table controls in the System.Web.UI.HtmlControls namespace (see Listing 8).
The Like page contains two tables and executes two stored procedures to fill them. One procedure looks for topics that begin with the same word as the existing topic from which the Like link was clicked. The second pulls up those topics that end with the same word as the current topic. For one-word topics, these tables are the same. The tables also display the time and date the topic was last modified.
The Recent page lists recently edited topics in descending order by the date and time edited. This provides users with a simple way to see where the activity is. A simple form lets you change the number of days to go back in the query. The default is 10. The page lists the topic (linked to the view page), the author of the most recent edit, and the date and time the page was edited.
The History page displays each of the major revisions to the given topic in order from most recent to the original. The only trick here is that the link to the most recent topic version is different from the others. The older versions of the topic are linked to the viewversion.aspx page, which is similar to the standard view page but instead of an Edit link, it has a link to the current version of the topic. This is done to prevent the illusion that a user can edit a previous version, which is not possible in the current project and not likely a feature you would care to implement.
The quick and obvious solution to this problem is to move previous versions of a topic to another table when a new version is added to the table, but that comes with its own set of problems. Solving those problems would probably not be too difficult - and if you took that road, no one would blame you. But it's not nearly as much fun as building your own index and search solution - your own wiki search engine.
A year ago I did some research into search engine technology, but let me be the first to disclaim any great level of understanding of this topic or experience in building sophisticated indexing and search routines. There are many off-the-shelf solutions that may be used to provide a similar or even better result. But again, what's the fun in that?
The wiki search engine consists of two primary elements: the WikiIndexer, a Windows Service project; and the wiki query mechanism, an ASP.NET page in the wiki that queries a view in the database. This solution requires the addition of four tables and one view in the database, along with a number of procedures to allow them to be used from the index service and the wiki application. The new tables are WikiWords, WikiWord Index, WikiTopicRank, and WikiSummary.
The WikiWords table is a unique index of every searchable word in the wiki. The indexer ignores one- and two-letter words, as well as a set of common three-letter words (see Listing 1). The indexer loads this entire table into a hashtable object so that it will only have to write to the table when words are found in the wiki that do not already exist in the table. You might worry that this table would grow too large, given the several hundred thousand words in the English language claimed by most unabridged dictionaries; however, optimistic estimates of commonly used vocabulary are much lower. I found one reference to a study that determined that people of extraordinary intelligence had an average working vocabulary of only 20,000 words. In my own tests of the indexer, I randomly pulled about 50 news stories from the Internet and ended up with just over 4,000 unique words.
The WikiWordIndex table is a cross-index table containing only three integer columns. The first two are foreign keys into the WikiWords and WikiTopic tables. The third is a count column that holds the number of times a given word occurs in a given topic.
The Search stored procedure pulls from the WikiSearch view with a specified weight given to the rank of a given topic and to the number of times the words in the search phrase occur in the topic. The WikiTopicRank table contains a rank value for each current topic in the wiki. Previous versions are removed from the index tables upon completion of the indexing routines. The topic rank is simply a count of how many times other wiki topics link to the topic in question. This gives you an indication of how popular this topic is in the wiki.
The WikiSummary table is a simple repository that keeps track of when a topic was last indexed and stores a summary of the topic. Currently, the summary simply takes the first 256 characters in a topic. In future versions of your wiki, you may wish to enhance the index service with a more sophisticated summarizing routine.
The WikiSearch view draws data from the three tables just described, as well as the WikiTopic table. This view is used by the Search stored procedure, which has an inner select on the WikiWords table, so each of the tables in the database is used to accomplish the search. With a smaller content dataset this query is quite fast, but as time goes on and your content grows, you will probably want to examine the efficiency of the Search stored procedure and make adjustments to the indexes to improve its performance.
The indexing process is relatively simple. First I load a list of unique topic names. Then I iterate through each topic, retrieving the most recent version of each, and index the text contained in the topic. If I find new words, I add them to the words hashtable and the WikiWords table. I then pull the WikiWordIndex records for this topic into a hashtable so that I can find out whether each word in the topic has been previously indexed. It also allows me to compare the number of times that word occurred previously - if at all - with the number of times it occurs in the topic now. If the word did not exist previously, a new record is written to the WikiWordIndex table (see Listing 2). If the number of occurrences in a topic changes for a previously indexed word, the record is updated. If nothing changes, no write to the database occurs.
Once the words are counted and accounted for, I write the summary of the topic to the WikiSummary table. Then the fun begins. I used code from the wiki parser in the wiki ASP.NET application to find the wiki topic links in the topic being indexed at the moment (see Listing 3). When a link to another topic is found, I update the linked topic's hashtable count value to be used once I have looped through each current topic.
Now that I have completed my loop through the current topics, I turn my attention to the topic ranks. I pull a list of existing topic ranks into another hashtable and compare the ranks that already exist with those that have been counted during this run of the indexer (see Listing 4). If the rank has been changed, an update is run on the WikiTopicRank table. If the rank found during indexing does not exist, then a new record is written to the table.
The ranking has been updated now, but there may be topics in the index that no longer belong because they have been updated with major revisions since the indexer last updated the index. The cleanup routine loads the existing topic IDs into a hashtable and compares them with the IDs of current topics just indexed. If the topic ID in the existing indexed topic hashtable does not exist in the most recently indexed topic hashtable, then rows referencing that topic ID are deleted from the index tables (see Listing 5).
The WikiIndexer service is easy to install using the .NET Framework's InstallUtil.exe utility. Once you compile the indexer, just copy the contents of the bin directory to a nice permanent home. Open a command prompt, change directories to your permanent home, and execute the install utility. You can find many tutorials on how to create a Windows Service using VS.NET, so I won't waste your time here with that. As long as the service has access to the database, it doesn't really matter where you run it. The XML config file contains the two values you need to configure in order for the service to run properly. The first is the connection string and the second is the number of hours between each run of the indexer. You may wish to create a more sophisticated scheduling mechanism for running your indexer, but that's outside the scope of this article.
Implementing the ASP.NET search page in the wiki is easy enough. The new search.aspx page consists of a simple form to take the user's search query with a Search button. The button click handler prepares and validates the search terms and then executes the search on the database using the WordSearch stored procedure (see Listing 6). The results of the query are bound to a simple ASP.NET Repeater control that contains a hyperlink and literal control for the link and the summary of each search result.
In the ItemDataBound event of the Repeater, I create the link and, using the wiki parser, I parse the summary into the literal control (see Listing 7). This can push the results page to a fairly long scroll so you may wish to consider changing how you display your search results. To keep things simple, I did not add a paging mechanism to the search page, nor did I apply special formatting to the topic link, which means you get it as it exists in the database.
Conclusion
Now you have a good general-purpose wiki. In Part 3 of this series, I will add some of the corporate-like features your users in the workplace will undoubtedly want, including the ability to add a file repository to your wiki that will let users upload files to the wiki and an expanded parser that will allow users to reference those files in their topics for download or, in the case of GIFs and JPGs, for display within the topic itself. Until then, enjoy building your own search engine.
Published March 11, 2004 Reads 16,308
Copyright © 2004 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Tyler Jensen
Engrossed in enterprise application architecture and development for over ten years, Tyler Jensen is a senior technical consultant in a large health intelligence company, designing and developing claims processing and analysis software. In his spare time he does a little writing and outside consulting.
- iPad3 vs Windows 8 - and the Winner Is...Cloud
- Eleven Reasons Why Windows Phone Will Overtake Android
- Windows Azure Overview Part 4: Security
- Eleven Tips for Successful Cloud Computing Adoption
- Agile Development & Enterprise Architecture Practice – Can They Coexist?
- GM to Pull Facebook Advertising: WSJ
- System Center Virtual Machine Manager 2012 as Private Cloud Enabler
- Apply Agile When Deploying Apps
- The Web – Changing the Way We Work
- EE Times and EDN Announce the 2012 UBM Electronics ACE Award Winners
- Closer Look at One NoSQL Database – MongoDB
- Why Is Scrum So Widely Adopted and So Very Dangerously Deceptive
- iPad3 vs Windows 8 - and the Winner Is...Cloud
- Cisco Unveils Visual Collaboration Solutions in the Post-PC Era, Extending the Reach of TelePresence With New Mobile-to-Immersive Offerings
- Eleven Reasons Why Windows Phone Will Overtake Android
- Windows Azure Overview Part 4: Security
- Eleven Tips for Successful Cloud Computing Adoption
- Agile Development & Enterprise Architecture Practice – Can They Coexist?
- GM to Pull Facebook Advertising: WSJ
- System Center Virtual Machine Manager 2012 as Private Cloud Enabler
- Apply Agile When Deploying Apps
- The Web – Changing the Way We Work
- Book Review: Decision Management Systems
- User Group Malaise?
- Google Maps and ASP.NET
- Converting VB6 to VB.NET, Part I
- How to Write High-Performance C# Code
- Crystal Reports XI & How It Has Changed
- Creating Controls for.NET Compact Framework in Visual Studio 2005
- Where Are RIA Technologies Headed in 2008?
- Programmatically Posting Data to ASP .NET Web Applications
- Implementing Tab Navigation with ASP.NET 2.0
- AJAX World RIA Conference & Expo Kicks Off in New York City
- i-Technology Viewpoint: "SOA Sucks"
- .NET Archives: Getting Reacquainted with the Father of C#
- i-Technology Photo Exclusive: Bill Gates & Steve Jobs In "Nerds"





















