| By Jeff Heaton | Article Rating: |
|
| July 7, 2003 12:00 AM EDT | Reads: |
18,351 |
Spiders are used for many purposes on the Internet. Search engines use spiders to locate Web pages for their databases. Companies use spiders to monitor their competitors' Web sites and track changes. Individual users use spiders to download the contents of Web pages for later offline viewing. Developers use spiders to scan their own Web sites looking for broken links and other issues. Spiders are used by many people for a variety of different purposes. But what exactly is a spider?
Spiders are semi-autonomous programs that travel Web links just the way a real spider travels strands of its web. A spider is semi-autonomous in that it is given an initial Web page to start from, but where the spider goes is determined as it runs. The spider scans the starting page for links, and later visits those links. Each of the pages that was linked to will in turn be checked for links as well. Theoretically, a spider would eventually visit every page on the Internet, as nearly every Web site is linked in some way to other Web sites.
In this article I will show you how to construct a spider using the C# programming language. This spider will download the entire contents of a Web site into a specified local directory. You can easily use the core classes of this project to construct your own spider projects. The complete source code for this project can be downloaded from www. sys-con.com/dotnet/sourcec. You can see this program running in Figure 1.
C# is particularly suited to spider programming because threading and
HTTP access are both built in. This makes it relatively easy to create a
spider. The major steps to creating an HTML spider are as follows:
1. HTML parser: You will need some sort of HTML parser to scan the HTML
documents the spider will encounter. I will show you how to construct one.
2. Page processing: You will need code that will process each page downloaded. You may choose to save the contents of the site to disk or
choose some other action.
3. Threading: Spiders work most efficiently when multithreaded. You will need the infrastructure to divide the spider process among threads.
4. Determining completion: This is harder than it sounds, especially when multithreading is being used.
HTML Parsing
The C# programming language does not include support for HTML parsing.
XML parsing is supported; however, parsers designed for the rigid XML syntax
are nearly useless for parsing HTML, which is a much more flexible syntax.
To solve this I created an HTML parser in C#. This parser is self-contained
and could easily be used with any C# application that needs HTML parsing.
Using the HTML parser is very easy; you begin by instantiating a new instance of the ParseHTML class. Next you set the Source property to the HTML document that you would like to parse.
ParseHTML parse = new ParseHTML();
parse.Source = "<p>Hello World</p>";
You will now be able to loop through all of the text and tags that make up the HTML document. You will usually begin with a while loop that checks the Eof method.
while(!parse.Eof())
{
char ch = parse.Parse();
The Parse() method will return the characters that make up the HTML document. Only those characters that are not part of HTML tags will be returned. If you reach an HTML tag, zero will be returned to indicate that a tag has been found.
Once a tag has been found, the GetTag() method can be used to process it.
if(ch==0)
{
HTMLTag tag = parse.GetTag();
}
A spider will generally be concerned with locating HREF attributes in the tag. This can be done by using the indexing feature of C#. The following line would retrieve the value of the HREF attribute, if present.
Attribute href = tag["HREF"];
string link = href.Value;
Once the attribute has been retrieved, the Value property is used to determine what value is stored in this attribute.
Processing the Pages
Now I will show you how to process the HTML pages. First, the page must
be downloaded. This is done by using the HttpWebRequest class provided by
C#.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(m_uri);
response = request.GetResponse();
stream = response.GetResponseStream();
A stream is then created from the request. However, before we accept any data we must determine if this file is binary or text. Binary and text files are handled differently. The following lines determine if the file is binary.
if( !response.ContentType.ToLower().StartsWith("text/") )
{
SaveBinaryFile(response);
return null;
}
string buffer = "",line;
If the file is not a binary file, then it will be read as text. To do this, a reader is obtained and the file is added to a buffer, line by line.
reader = new StreamReader(stream);
while( (line = reader.ReadLine())!=null )
{
buffer+=line+"\r\n";
}
Once the file has been loaded it will be saved as a text file.
SaveTextFile(buffer);
So far I have shown you only how the spider determines whether to save the file as binary or text data. Now I will show you how each of these two file types are stored.
Binary files have a content type that does not begin with "text/". The spider copies the binary files directly to a disk file and no further processing is done on the binary files. Binary files do not contain HTML, so there are no further links for the spider to process. The following method is used to write the contents of a binary file.
First, a buffer is prepared that will hold segments of the binary file as it is transferred.
byte []buffer = new byte[1024];
Next you must determine the path and name of the local file that you would like to open. For example, if you were downloading the site to the local directory "c:\test\" and the filename were "http://myhost.com/images/logo.gif", the output filename and path would be "c:\test\images\logo.gif". Further, you would need to ensure that the images directory is created. This is done using the convertFilename method.
string filename = convertFilename( response.ResponseUri );
The convertFilename method breaks apart the HTTP address and creates the appropriate directories. Now that you have the output file and pathname you can open an input stream from the Web page and a local output stream to the output file.
Stream outStream = File.Create( filename );
Stream inStream = response.GetResponseStream();
You are now ready to read the contents of the Web file to the local file. This is done with the following lines of code.
int l;
do
{
l = inStream.Read(buffer,0,
buffer.Length);
if(l>0)
outStream.Write(buffer,0,l);
} while(l>0);
As you can see, the input stream is looped through as its contents are written to the local file. Once the file has been written, the streams can be closed.
outStream.Close();
inStream.Close();
It is easier to download the text files. A text file is any file found that has a content type that begins with "text/". Text files will have already been downloaded and stored in a string. This string is used to parse out the links, but it can also be written to disk. The following lines are used to save these text files.
string filename = convertFilename( m_uri );
StreamWriter outStream = new StreamWriter( filename );
outStream.Write(buffer);
outStream.Close();
As you can see, an output file stream is opened and the contents of the buffer are written directly to the file. The file is then closed.
Thread Handling
Multithreading allows a computer to appear to be doing more than one
thing at once. However, unless your computer is a multiprocessor, this is
only an illusion; your computer will be switching between threads very
quickly. In general there are only two scenarios in which threads will make
a program run faster. The first is when you are using a multiprocessor
computer. If you are using a multiprocessor computer and not using threads,
your program will never execute any faster because there is no way to split
the task to the second processor.
The second scenario is the case where your program is routinely waiting for an external event to occur. This is exactly what is happening with the spider. The spider must request a URL, wait for the file to download, and then request the next URL. It would be much better to request several URLs and wait for them all at once. Threading allows you to do this, as the spider can use multiple threads to request pages simultaneously.
To do this, the DocumentWorker class is used to isolate all of the work performed for a single URL. As each DocumentWorker class is created, it enters a loop waiting for the next URL to process. The main loop of a DocumentWorker is shown here.
while(!m_spider.Quit )
{
m_uri = m_spider.ObtainWork();
m_spider.SpiderDone.WorkerBegin();
string page = GetPage();
if(page!=null)
ProcessPage(page);
m_spider.SpiderDone.WorkerEnd();
}
You begin by entering a while loop that will continue until the spider's quit flag is set to true. The quit flag will be set to true if the cancel button has been clicked. Next, a URL is obtained by calling the ObtainWork method. The ObtainWork method will wait until a URL is available. URLs will become available as other threads parse documents and find links. The WorkerBegin and WorkerEnd methods are used by the Done class to determine when the process is finished.
You are allowed to specify how many threads the program will use. The optimal number of threads to use will depend on a number of factors. If your computer is fast, or is a dual processor, higher numbers of threads will work best. However, if your bandwidth is limited, a lower number of threads may process just as many URLs per second as a higher number of threads.
Are We There Yet?
Creating multiple threads to process the URLs will speed up the program.
However, these threads also raise other administrative issues. One of the
most challenging is determining when the spider is done. To do this I
created a simple class named Done. This class can be used to determine when
the spider is done processing.
First, I must define what exactly I mean by "done." The spider is considered to be done when there are no URLs waiting to be processed, and all of the worker threads are also done processing. This means there are no URLs waiting and no URLs being processed. The spider will not get any additional work when it has reached this state.
The Done class provides a WaitDone method you can call that will wait until the Done object detects that the spider is done. The WaitDone method is shown here.
public void WaitDone()
{
Monitor.Enter(this);
while ( m_activeThreads>0 )
{
Monitor.Wait(this);
}
Monitor.Exit(this);
}
The WaitDone method will wait until there are no active threads. You have to be careful, though; at the very beginning of the process there will not be any active threads. It would be really easy for the spider to stop prematurely as it starts up, detects no threads are active, and then immediately shuts down. To solve this problem a method, named WaitBegin, is provided to allow you to wait for the spider to begin. You should call WaitBegin, followed by WaitDone, which will wait for the spider to stop. The WaitBegin method is shown here.
public void WaitBegin()
{
Monitor.Enter(this);
while ( !m_started )
{
Monitor.Wait(this);
}
Monitor.Exit(this);
}
The WaitBegin method waits for the m_started flag to be set. This is set by the WorkerBegin method. As the worker threads begin processing each URL they call WorkerBegin, and when they are finished they call WorkerEnd. These two methods allow the Done object to track progress and determine when the spider is done. The WorkerBegin method is shown here.
public void WorkerBegin()
{
Monitor.Enter(this);
m_activeThreads++;
m_started = true;
Monitor.Pulse(this);
Monitor.Exit(this);
}
The WorkerBegin method begins by increasing the number of active threads. The m_started flag is also set. Finally, Pulse is called to free one other thread that might be waiting for the worker to begin. The method that might be waiting for the Done object would be the WaitBegin method. The WorkerEnd method is called after each URL is processed. The WorkerEnd method is shown here.
public void WorkerEnd()
{
Monitor.Enter(this);
m_activeThreads--;
Monitor.Pulse(this);
Monitor.Exit(this);
}
The WorkerEnd method decreases the m_activeThreads counter and calls Pulse to free a thread that might be waiting for the Done object. The method that might be waiting for the Done object would be the WaitDone method.
Conclusion
Not all sites welcome spiders. To test the spider I used a variety of
sites. With any site you should be aware of that site's copyright and terms
of service. Most government sites are public domain and can legally be used;
however, you should check to be sure.
This article gives a fundamental explanation of using a spider. The source code provided with this article will also give you a great start on your own spider projects. The source is flexible enough that it is easily adapted to other uses.
Published July 7, 2003 Reads 18,351
Copyright © 2003 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jeff Heaton
Jeff Heaton is the author of “Programming Spiders, Bots and Aggregators in Java” by Sybex. Jeff can be contacted through his website at http://www.jeffheaton.com.
- Kindle 2 vs Nook
- Practical Approaches for Optimizing Website Performance
- SQL Anywhere Server and AJAX
- PowerBuilder Top Feature Picks
- The Difference Between Web Hosting and Cloud Computing
- PowerBuilder 12 and .NET
- Contrary Opinion: Why Silverlight is Good for Adobe
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Wave on Ulitzer: Confessions of a Google Wave Fanboy
- Cloud Computing Best Practices
- AJAX World RIA Conference & Expo Kicks Off in New York City
- Rich Content Rotator for ASP.NET
- RIAs for Web 3.0 Using the Microsoft Platform
- Kindle 2 vs Nook
- Practical Approaches for Optimizing Website Performance
- Social Media Terrorists
- SQL Anywhere Server and AJAX
- SYS-CON's Cloud Expo Adds Two New Tracks
- PowerBuilder Top Feature Picks
- The Difference Between Web Hosting and Cloud Computing
- Google Maps and ASP.NET
- Crystal Reports XI & How It Has Changed
- Converting VB6 to VB.NET, Part I
- Creating Controls for.NET Compact Framework in Visual Studio 2005
- Where Are RIA Technologies Headed in 2008?
- How to Write High-Performance C# Code
- AJAX World RIA Conference & Expo Kicks Off in New York City
- Implementing Tab Navigation with ASP.NET 2.0
- i-Technology Photo Exclusive: Bill Gates & Steve Jobs In "Nerds"
- .NET Archives: Getting Reacquainted with the Father of C#




































