|
|
YOUR FEEDBACK
|
TOP MICROSOFT .NET LINKS Performance
Add a Spider to the Web
By: Jeff Heaton
Digg This!
Spiders are used for many purposes on the Internet. Search engines use spiders to locate Web pages for their databases. Companies use spiders to monitor their competitors' Web sites and track changes. Individual users use spiders to download the contents of Web pages for later offline viewing. Developers use spiders to scan their own Web sites looking for broken links and other issues. Spiders are used by many people for a variety of different purposes. But what exactly is a spider? Spiders are semi-autonomous programs that travel Web links just the way a real spider travels strands of its web. A spider is semi-autonomous in that it is given an initial Web page to start from, but where the spider goes is determined as it runs. The spider scans the starting page for links, and later visits those links. Each of the pages that was linked to will in turn be checked for links as well. Theoretically, a spider would eventually visit every page on the Internet, as nearly every Web site is linked in some way to other Web sites. In this article I will show you how to construct a spider using the C# programming language. This spider will download the entire contents of a Web site into a specified local directory. You can easily use the core classes of this project to construct your own spider projects. The complete source code for this project can be downloaded from www. sys-con.com/dotnet/sourcec. You can see this program running in Figure 1.
C# is particularly suited to spider programming because threading and
HTTP access are both built in. This makes it relatively easy to create a
spider. The major steps to creating an HTML spider are as follows:
HTML Parsing Using the HTML parser is very easy; you begin by instantiating a new instance of the ParseHTML class. Next you set the Source property to the HTML document that you would like to parse.
ParseHTML parse = new ParseHTML(); You will now be able to loop through all of the text and tags that make up the HTML document. You will usually begin with a while loop that checks the Eof method.
while(!parse.Eof()) The Parse() method will return the characters that make up the HTML document. Only those characters that are not part of HTML tags will be returned. If you reach an HTML tag, zero will be returned to indicate that a tag has been found. Once a tag has been found, the GetTag() method can be used to process it.
if(ch==0) A spider will generally be concerned with locating HREF attributes in the tag. This can be done by using the indexing feature of C#. The following line would retrieve the value of the HREF attribute, if present.
Attribute href = tag["HREF"]; Once the attribute has been retrieved, the Value property is used to determine what value is stored in this attribute.
Processing the Pages
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(m_uri); A stream is then created from the request. However, before we accept any data we must determine if this file is binary or text. Binary and text files are handled differently. The following lines determine if the file is binary.
if( !response.ContentType.ToLower().StartsWith("text/") ) If the file is not a binary file, then it will be read as text. To do this, a reader is obtained and the file is added to a buffer, line by line.
reader = new StreamReader(stream); Once the file has been loaded it will be saved as a text file. SaveTextFile(buffer); So far I have shown you only how the spider determines whether to save the file as binary or text data. Now I will show you how each of these two file types are stored. Binary files have a content type that does not begin with "text/". The spider copies the binary files directly to a disk file and no further processing is done on the binary files. Binary files do not contain HTML, so there are no further links for the spider to process. The following method is used to write the contents of a binary file. First, a buffer is prepared that will hold segments of the binary file as it is transferred. byte []buffer = new byte[1024]; Next you must determine the path and name of the local file that you would like to open. For example, if you were downloading the site to the local directory "c:\test\" and the filename were "http://myhost.com/images/logo.gif", the output filename and path would be "c:\test\images\logo.gif". Further, you would need to ensure that the images directory is created. This is done using the convertFilename method. string filename = convertFilename( response.ResponseUri ); The convertFilename method breaks apart the HTTP address and creates the appropriate directories. Now that you have the output file and pathname you can open an input stream from the Web page and a local output stream to the output file.
Stream outStream = File.Create( filename ); You are now ready to read the contents of the Web file to the local file. This is done with the following lines of code.
int l; As you can see, the input stream is looped through as its contents are written to the local file. Once the file has been written, the streams can be closed.
outStream.Close(); It is easier to download the text files. A text file is any file found that has a content type that begins with "text/". Text files will have already been downloaded and stored in a string. This string is used to parse out the links, but it can also be written to disk. The following lines are used to save these text files.
string filename = convertFilename( m_uri ); As you can see, an output file stream is opened and the contents of the buffer are written directly to the file. The file is then closed.
Thread Handling The second scenario is the case where your program is routinely waiting for an external event to occur. This is exactly what is happening with the spider. The spider must request a URL, wait for the file to download, and then request the next URL. It would be much better to request several URLs and wait for them all at once. Threading allows you to do this, as the spider can use multiple threads to request pages simultaneously. To do this, the DocumentWorker class is used to isolate all of the work performed for a single URL. As each DocumentWorker class is created, it enters a loop waiting for the next URL to process. The main loop of a DocumentWorker is shown here.
while(!m_spider.Quit ) You begin by entering a while loop that will continue until the spider's quit flag is set to true. The quit flag will be set to true if the cancel button has been clicked. Next, a URL is obtained by calling the ObtainWork method. The ObtainWork method will wait until a URL is available. URLs will become available as other threads parse documents and find links. The WorkerBegin and WorkerEnd methods are used by the Done class to determine when the process is finished. You are allowed to specify how many threads the program will use. The optimal number of threads to use will depend on a number of factors. If your computer is fast, or is a dual processor, higher numbers of threads will work best. However, if your bandwidth is limited, a lower number of threads may process just as many URLs per second as a higher number of threads.
Are We There Yet? First, I must define what exactly I mean by "done." The spider is considered to be done when there are no URLs waiting to be processed, and all of the worker threads are also done processing. This means there are no URLs waiting and no URLs being processed. The spider will not get any additional work when it has reached this state. The Done class provides a WaitDone method you can call that will wait until the Done object detects that the spider is done. The WaitDone method is shown here.
public void WaitDone() The WaitDone method will wait until there are no active threads. You have to be careful, though; at the very beginning of the process there will not be any active threads. It would be really easy for the spider to stop prematurely as it starts up, detects no threads are active, and then immediately shuts down. To solve this problem a method, named WaitBegin, is provided to allow you to wait for the spider to begin. You should call WaitBegin, followed by WaitDone, which will wait for the spider to stop. The WaitBegin method is shown here.
public void WaitBegin() The WaitBegin method waits for the m_started flag to be set. This is set by the WorkerBegin method. As the worker threads begin processing each URL they call WorkerBegin, and when they are finished they call WorkerEnd. These two methods allow the Done object to track progress and determine when the spider is done. The WorkerBegin method is shown here.
public void WorkerBegin() The WorkerBegin method begins by increasing the number of active threads. The m_started flag is also set. Finally, Pulse is called to free one other thread that might be waiting for the worker to begin. The method that might be waiting for the Done object would be the WaitBegin method. The WorkerEnd method is called after each URL is processed. The WorkerEnd method is shown here.
public void WorkerEnd() The WorkerEnd method decreases the m_activeThreads counter and calls Pulse to free a thread that might be waiting for the Done object. The method that might be waiting for the Done object would be the WaitDone method.
Conclusion This article gives a fundamental explanation of using a spider. The source code provided with this article will also give you a great start on your own spider projects. The source is flexible enough that it is easily adapted to other uses. MICROSOFT .NET LATEST STORIES
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING NEWS FROM THE WIRES
|
||||||||||||||||||||||||||||||||||