Dr. Matthias Laux
September, 2003
Download
Frequent users of the Internet face a common problem: there are
so many interesting links which they not only would like to come back to
at some point, but which are also interesting to others -- for example,
colleagues who share a corporate intranet, and who are interested in similar areas. Browser
bookmarks are suitable for storing only small numbers of such links, since they are only visible on one particular machine (unless the browser
configuration data is shared via some mechanism like NFS), and are not shareable with other users of the network.
So sooner or later, many of us create simple HTML pages with
our favorite links, typically organized by main area and subtopic, and often including a personal note in the design to make
them even more interesting to other users.
But the web is a very dynamic medium, and links
are bound to be migrated to some other location, or vanish entirely for
any number of reasons. So these web pages have to be maintained. This process,
however, is almost impossible to handle manually, since you'd have to
click on each link, see what happens, and update the HTML code manually
depending on the response received. With hundreds of links in a typical
collection, this is clearly not practical. The lack of a solution to this problem sooner or later leads to
frustrating user experiences because of many broken
links. If the web page creator is lucky, s/he may get a friendly email
from someone who detected such a broken link; in the more unlucky cases, the
emails are not friendly, or users simply do not return to that collection,
thereby making it essentially obsolete.
But there is a solution. I have developed three tools
to store link information
in a database, allowing for automated link validity checks (including a corresponding update of the database). The tools also provide a powerful
mechanism to automatically update the web pages that contain the link collections,
using the open-source Velocity template engine.
(To give credit where credit is due: I got the idea
to create this set of tools after reading John Zukowski's
article "Validating URL Links," in which he describes a technique to check whether a HTTP URL is still available.)
-
URLManage manages all interactions with
the link data in the backend storage, which
is a relational database system. This RDBMS stores the data
in some tables, and either a new database needs to be set up, or an existing
database can be used (using a separate schema or instance) to hold these
tables. URLManage relies on the DBAccessor package, a JDBC wrapper that I developed earlier. URLManage creates, modifies, deletes,
and retrieves link data, and manages the database
schema itself.
- URLCheck automatically checks the links
stored in the database against the network; that is, URLCheck tries to
connect to that network resource and determine its status (availability).
The actions taken (such as updating or removing a link in the database)
depend on the status response received.
- URLPublish creates web pages using the
Velocity template engine, an Apache Jakarta
project. The link data from
the database is combined with a template web page to create the
actual HTML output required for publishing.
Combined, these tools are referred to as
the URLManager package, and it is available for
download. The rest of this article describes the three tools in great detail.
URLManage
Data Representation
The first design decision I made was to store the link data in a database to provide a clean structure for the information, and to
take advantage of all the usual benefits of an RDBMS (such as transaction or backup support).
The next step was to define the database schema. The information we
want to store for a link consists of the actual URL, description text,
and some additional information: the date this entry was created, the
date it was last checked, and the response code obtained during this check.
A link can be known in one or more contexts, where a context is basically
a topic area to which this link belongs.
Here's an example to shed more light on my approach. We have a link with the URL "http://java.sun.com" and the description "Sun's
main Java entry page" (plus the additional information described above).
In a link collection, this link could be known in the context of "Programming
Languages," but also in the context of "Sun Microsystems." Looking at
the actual web pages created, this could result in the link appearing
in two different chapters on that page, or on different pages -- all depending
on the output template(s) chosen (details on this will follow later).
This data structure is represented by two tables in the database. Their
schema is described by an XML file (schema.xml), and it
has the following properties:
| Table URLMGR_LINK |
| Column |
Type |
Length |
| URL |
VARCHAR |
200 |
| DESCRIPTION |
VARCHAR |
200 |
| CREATED |
VARCHAR |
20 |
| LASTCHECK |
VARCHAR |
20 |
| LASTCODE |
INTEGER |
|
| Table URLMGR_CONTEXT |
| Column |
Type |
Length |
| Column |
Type |
Length |
| URL |
VARCHAR |
200 |
| CONTEXT |
VARCHAR |
200 |
Note that the column names shown in bold form the primary
key for the respective tables, indicating which entries are unique.
An interesting detail is that the column lengths can be modified, if
necessary, in the XML file above (which is part of the URLManager package) before you actually create the database schema.
The DBAccessor package is used internally to transform this XML schema
description into SQL statements, which are then used to create these
tables in the database using the URLManage tool. When creating or updating
data in these tables, the chosen lengths of these fields are retrieved
directly from the database (again using DBAccessor capabilities). This information is then used
to check whether or not the link and context data provided
by the user fits into the space provided in the database tables.
Note that once the field lengths have been chosen and the
schema created, these lengths cannot be modified using the URLManager
package.
Links and Contexts
The two core data components -- links and contexts -- are modeled with the
classes Link and Context, both of which are
subclasses of DBAccessor's RowData class. This class provides
built-in persistence support via the insert(), update(),
select(), and delete() methods, which automatically
create and issue the corresponding SQL statements to perform these actions
in the database.
The Context class is actually very simple, holding just two
string-valued properties -- for the URL and the name of the context. To
create this class, only a few convenience extensions to the services
provided by RowData were implemented; all other required
capabilities (such as persistence or setter/getter methods) are provided
by RowData. The Link class is slightly more
complex, since it holds references to all contexts defined for a link
in a hash map. For example, to insert a link into the database, the insert()
method of the Link class would first call its own insert()
method (in the superclass), and then call insert() for all
of its contexts:
public class Link extends ml.jdbc.RowData {
java.util.HashMap contexts = new java.util.HashMap()
...
public void insert(ml.jdbc.DBUser user) throws ml.jdbc.AccessorException
{
super.insert(user);
ml.jdbc.Transfer.exportTableData(contexts.values(),
user);
}
...
}
|
DBAccessor offers another convenience bulk method here (exportTableData()),
which performs the insert step for the contexts under the cover.
Using URLManage
URLManage is a simple Java application whose main task is to first validate
the command line options provided by the user, and then issue these
commands to the database. This database interaction is handled by thePersistenceManager
class, an instance of which is created within URLManage. PersistenceManager
offers all the capabilities to manage links and contexts (creation,
deletion, updates, and selections of individual entries or lists of entries).
Other capabilities include bulk import of data from input files (which is useful
for the import of existing link collections) and database schema management
(creation and deletion of the database tables used to hold links
and contexts). Since PersistenceManager
provides all the persistence capabilities required, it could also be instantiated
and used by other tools, such as a graphical front end (rich client or web
UI) instead of the URLManage command line tool.
Before you use the URLManager
package, an important first step is to create the database schema required to hold the link information.
URLManage also supports schema management, and the database tables can
be created (or deleted, should this ever be necessary) using these commands:
java -classpath $CP ml.urlmgr.URLManage
-init db.properties
java -classpath $CP ml.urlmgr.URLManage -drop
db.properties
The property file db.properties contains the description of the database connection
parameters, as required by the DBAccessor package. This is a Java property file, that is, a text file with key|value
pairs, where key and value are separated by an equals sign. A typical file
might look like this:
HOST = host.company.com # Database host
PORT = 3306 # JDBC port
TYPE = msc # DB type (see DBAccessor API docs for details
# - these are included with the URLManager download)
NAME = urlmgr # The name/id of the database
USER = myuser # DB user holding the URLManager tables
PASS = passcode # Password for this user
|
This property file is passed on to
DBAccessor to set up the database connectivity, and no further data is
required, as long as the database is one of the types currently supported
by DBAccessor's default configuration. In the example above,msc indicates
a MySQL database. Other supported types are Oracle, DB2, Cloudscape, and
PostgreSQL. Naturally, DBAccessor also offers additional capabilities to
configure support for other database types.
Once the database schema has been established, a
typical usage example of URLManage would look like this:
java -classpath $CP ml.urlmgr.URLManage -v -c db.properties
"http://java.sun.com" "Sun's main Java entry page" "Programming Languages"
Here, the (optional) flag -v enables
verbose output, and -cselects link creation. The string parameters
are pretty much self-explanatory and correspond to the example described
above: the first argument is the actual HTTP
URL, the second argument is the description for this URL, and the third argument
is the context in which this URL is to be known.
If you wanted to add the additional context "Sun
Microsystems" for this link, the command would be:
java -classpath $CP ml.urlmgr.URLManage
-ac db.properties
"http://java.sun.com" "Sun Microsystems"
When migrating link collections to
the URLManager package, the facilities provided by URLManage for bulk imports
come in handy:
java -classpath $CP ml.urlmgr.URLManage -bc db.properties data.link
This command would import all the link data provided in the text file data.link.
This is much more efficient than importing many links using separate URLManage
invocations, since only one Java VM process needs to be created; it can
insert this data into the database using the same JDBC connection, as opposed
to creating a new VM process and database connection for each link.
Here is a typical input file for bulk link data creation:
http://www.sun.com
Sun Microsystems home page
Computer companies
http://www.sap.com
SAP AG home page
Computer companies
http://www.google.com
Google - a cool search engine for the WWW
Search Engines
..
|
The three string parameters required to create a link in the database
are provided on a separate line each: URL, description, and context.
Since link collections typically refer to some links in different contexts,
URLManage also provides a bulk import method for additional contexts:
java -classpath $CP ml.urlmgr.URLManage -bac db.properties data.context
An example bulk context data file might look like this:
http://www.sun.com
Leading UNIX Vendors
http://www.sap.com
ERP Vendors
...
The format is quite simple, expecting one line for the URL (which
is unique in the database), followed by one line for the additional context
for this URL.
You can create the text files for bulk data import based on existing link collections using tools like Perl or the Java regular
expression package (as of JDK 1.4), and then import into the database
used by the URLManager package.
You can obtain the complete usage description for the tool by invoking
it without (or with an illegal number of) arguments. The description
is also contained in the file doc/Manage.usage, which is
part of the distribution.
Now that we have stored all the required data in the database, the next step
is to provide a tool to check all of these links against the network and
to take appropriate action, depending on the outcome of this check. This
is what URLCheck was developed for.
The CheckManager Concept
A CheckManager is a class that allows a specific protocol (HTTP, HTTPS, FTP, LDAP, IMAP, and so on) to check whether the resource identified by a link is still available
on the network. The methods required by every CheckManagerare:
public abstract CheckResult check(Link link) throws URLManagerException;
public abstract boolean update(Link link, CheckResult result,
PersistenceManager persistence) throws URLManagerException;
public abstract void init(java.util.Properties config, boolean
verbose,
boolean update) throws URLManagerException;
Besides the check() method, the init()method
is required to transfer configuration data to an actual instance, whereas
the update() method implements the updates in the database,
depending on the result of a check. It is assumed here that the different
protocols (such as HTTP or FTP) return an integer-valued response code, which
is encapsulated in a CheckResult instance here. This helper
class holds the response code, but can also hold any number of additional
properties (via a generic mechanism based on a java.util.Properties
member variable), as required by the specific protocol checked. For HTTP,
as an example, CheckResult also holds the value of the Location
header in the HTTP response, which is required to properly
handle HTTP redirect responses.
All of this can vary, depending on the protocol to be checked. And (even
though, admittedly, most of the links encountered in link collections probably
use either HTTP or HTTPS) URLCheck was designed to allow for the inclusion
of any protocol, provided a CheckManager is implemented for
it. You can implement and add additional CheckManager subclasses to
the URLCheck tool without any code changes, just by specifying a corresponding
property file which contains all the required parameters for such a protocol,
especially the class name which implements this protocol's CheckManager.
For HTTP, this file could look like:
MANAGER = ml.urlmgr.HttpCheckManager
HTTP_PROXY = webcache.germany.sun.com
HTTP_PORT = 8080
UPDATE = 301|303
REMOVE = 404|410|500|505
(these properties are actually passed to the init()method
as the first argument)
MANAGER is the only mandatory property for all protocol properties files. This
property specifies the class name implementing the CheckManager
for the protocol. All other configuration parameters specified in these
property files are completely dependent on the CheckManager
implementation used for a specific protocol. In the example above,
other configuration parameters for the
HTTP CheckManager include the proxy configuration (if required) and (optionally) lists
of
HTTP response codes for which the link data needs to be updated in the database
due to redirections (UPDATE) or removals (REMOVE) -- for example, due to the
much-dreaded HTTP 404 response ("Not found"). These response codes are
specified in RFC 2616, and the
approach chosen here allows for a very flexible handling of update/remove actions, depending
on the user's requirements for the different response codes.
You define the different CheckManagers which are to be used within URLCheck
(and thus the different supported protocols to check for)
by using a property file specified as a command line argument to URLCheck.
An example for such a file would be:
http = config/http.properties
which basically means that the properties for the HTTP protocol
are specified in the given property file. Any number of protocols can
be handled in this way, where the names of the protocols are those returned
by java.net.URL.getProtocol().
CheckManager
also provides some basic services deemed useful for all
subclasses, the most important of which is the management of protocol
response codes. As mentioned before, all protocols are expected to return
some integer-valued response code. These are typically specified in RFCs
such as
Protocol
|
RFC
|
HTTP/HTTPS
|
2616
|
LDAP
|
2251
|
FTP
|
640
|
IMAP
|
2060
|
and CheckManager offers the following methods to support
generic response code management:
public void addCode(int code, String description)
public int getMinCode()
public int getMaxCode()
public String getCodeText(int code)
The idea here is that CheckManager subclasses
define the response codes for the protocol they handle in their init()
method using these methods.
Using URLCheck
The following code invokes a complete check of links using URLCheck:
java -classpath $CP ml.urlmgr.URLCheck -u -v db.properties
web.properties
Again, -v (optional) enables verbose output. The optional
flag -u causes the specified update/remove actions to be
actually executed (without this flag, the links would only be checked,
but no changes would be made to the database). The database properties
file is again required to access the database, and the web properties file
is the master properties file described above, which contains references
to the properties files for the individual supported protocols.
To obtain a complete usage description, you invoke URLCheck
without (or with the wrong number of) arguments. URLCheck uses the following operation sequence:
- First, URLCheck reads the web properties file and instantiates
a
CheckManager instance for all the protocols specified,
then calls the init() method for these instances with the
protocol-specific properties as an argument. These CheckManager
instances are stored in another helper class, ProtocolHandler.
This class also holds integer counters which are used to collect statistics
on the responses received to a protocol during the checks; statistics are printed after all links have been checked.
- Next, a
PersistenceManager is created using the
database properties file specified on the command line. This instance
is then used to retrieve all the links from the database.
- All these links are treated in a loop, and the
CheckManager.check()
method is called for each link, provided that such an instance exists
for the link's protocol (if not, a warning message is printed). Statistics
are collected based on the CheckResult object received using
the ProtocolHandler's count() method.
- The link columns containing the date of the last check and the
response code obtained during that check are updated
- The
CheckManager's update() method
is called. It is now that instance's responsibility to effect any changes
in the database required, depending on the settings provided in the properties
file for this protocol.
- After all links have been checked, statistics are printed for
each protocol, using
ProtocolHandler.printStatistics().
Developing other CheckManagers
Currently, the only CheckManager actually implemented is
based on the HTTP protocol to cover the most important case. Additional
CheckManagers are fairly easy to implement, since they can extend the abstract
CheckManager class and use their services. The following issues
need to be addressed before you develop a new implementation class:
-
First, determine how a specific protocol can actually be accessed
on the network from within the Java code. Although the Java class library is very
large, it doesn't contain handler classes to support all the relevant web
protocols; you'll need to find (or create) a Java
package that provides the required services for a protocol.
-
Identify and define the parameters that are required to configure
the handling of the protocol. These need to be defined in the protocol-specific
properties file, and will be provided through the
java.util.Properties
argument to the init() method.
-
Create the actual implementation class by extending
CheckManager
and providing the init(), update(), and check()
methods. The HttpCheckManager class can serve as an example
of how this can be done.
-
Add a property to the web properties file to enable the handling
of the protocol and to identify the protocol-specific properties that the
implementation class needs to perform its tasks.
One additional complication when working with non-HTTP protocols is
the use of proxy servers -- for example, when accessing the WWW from within
corporate networks. Since both the client's browser and the proxy server
forwarding the request use HTTP as their protocol, the HTTP response codes
are visible on the client side. Other protocols, however, will be wrapped
within an HTTP request in between the client and the proxy, and only the
proxy will then use the actually chosen (non-HTTP) protocol to access the
network resource. One such example is the FTP protocol: an FTP request
of the form ftp://server.acme.com would be sent to the proxy as an HTTP
GET request for the URL ftp://server.acme.com. The proxy will then access
that FTP server directly by connecting to port 21 (the default FTP port).
The response back to the client browser will again be wrapped into a message
transferred by HTTP. The problem here is to identify the actual response
code of the FTP server, since this will also be wrapped within an HTML message
transferred by HTTP. This is something a CheckManager implementation
for FTP would need to address.
The Velocity Template Engine
The final step in the process of keeping link collections up-to-date
is to recreate the web pages based on the data that has been checked
(and possibly updated) by URLCheck. The Java-based Velocity template
engine is a very
convenient tool to achieve this goal, with only a few lines of additional
code required.
The approach Velocity uses is very simple, yet powerful:
-
A text file ("template") is instrumented with tags that Velocity
recognizes. This template is then processed within a Java application using
the Velocity API, and Velocity parses these tags and fills them with data
obtained from the Java application, where required. From within Velocity
tags, Java objects and methods can be accessed directly using reference
names. In addition, these tags provide some capabilities available in other programming
languages -- for example, flow-control structures. Velocity also allows for
the definition of macros, which is very useful to avoid repeating the same
tag structures several times in a template.
- To establish the connection between the actual data required to fill the template
and the template itself, we use the
VelocityContext
class. Using this context, Java objects are assigned to the reference names
used in the template. Velocity provides very powerful capabilities (based
on Java Reflection) to figure out what capabilities such a Java object has.
This covers, for example, automatically identifying property getter methods
for JavaBeans, or providing iteration capabilities for objects based on
the Java Collections API with a very simple syntax.
-
While Velocity can be used to create any kind of text file -- producing,
for example, SQL, PostScript, or even Java output -- in the case of the URLManager,
the primary focus is HTML.
One additional benefit of Velocity is that it directly supports the
MVC approach by letting the web designer focus on the View (the template),
and the application programmer focus on the Model (in our case, the URLManage
tool and the database) and the Controller (URLPublish).
Here is an example of a Velocity template that would create a web page
with all the links in the database, grouped by context:
#macro( list $context )
<p><b>$context</b> </p>
<ul>
#foreach( $link in $links.get($context) )
<li> <a href="/developer/technicalArticles/Programming/linkupdate/$link.url"> $link.Description </a> </li>
#end
</ul> <p>
#end
<html>
<body>
#foreach( $context in $contexts )
#list($context)
#end
</body>
</html>
|
Here, $links and $contexts are names
which are linked with Java objects using the VelocityContext
in URLPublish:
PublishManager manager = new PublishManager(...);
VelocityContext context = new VelocityContext();
context.put("links", manager.getMap());
context.put("contexts", manager.getContexts());
$links references a java.util.HashMap that maps context
names to instances of java.util.TreeSet. Each TreeSet
instance holds all the Link
objects for that context. $contexts represents a java.util.TreeSet
instance holding just context names. Note that TreeSets
provide sorting capabilities: while the natural sort order is used for
the context names, a custom comparator has been implemented to sort links
according to various criteria. Currently, you can sort by creation date and by link description, selected via URLPublish command line flags.
In the example above, the list macro takes the name of
a context as argument ($context) and uses it to first print
a header line with the context name, and then create an HTML list with
all the links available for this context. Velocity automatically determines
that $links.get($context) is a java.util.Collection
and iterates over it. You can access the link data using an abbreviated
syntax ($link.url) which Velocity translates to a call to
the Link.getUrl() method. Note that -- apart from the Velocity
tags which will be replaced by plain HTML code -- this template is a simple
HTML page, and thus all the fancy layout techniques required to make a page visually
attractive can be employed in the usual way by a web designer, if necessary.
Running URLPublish with the template described above and some simple
test data results in this output web page:
<html>
<body>
<p><b>Companies</b> </p>
<ul>
<li> <a href="http://www.sun.com"> Sun Microsystems Inc. </a> </li>
</ul> <p>
<p><b>Programming Languages</b> </p>
<ul>
<li> <a href="/j2se"> J2SE home page </a> </li>
<li> <a href="/index.jsp"> Sun's main Java page </a> </li>
</ul> <p>
<p><b>Sun Microsystems</b> </p>
<ul>
<li> <a href="/index.jsp"> Sun's main Java page </a> </li>
</ul> <p>
</body>
</html>
|
Using URLPublish
You invoke URLPublish with this command:
java -classpath $CP ml.urlmgr.URLPublish -d -r -v db.properties
web.vm page.html
The optional -v flag enables verbose output, whereas -d
enables sorting of the links by creation date (the default is to sort links
by their description text). The optional -r flag enables reverse
sorting (that is, it toggles between ascending and descending sort order). The
database properties file is again required to access the database, and
the Velocity template file to use is the second argument (here: web.vm). The output file
to create (here: page.html) completes the set of arguments.
As with the other tools, you can obtain a complete usage description by invoking URLPublish
without (or with the wrong number of) arguments.
Internally, URLPublish uses a PublishManager helper class
to assemble the data structures holding the link and context data with
the input taken from the database. These data structures are then merged
into the template through the VelocityContext.
It's simple to create several web pages for different topic areas: the contexts to be included on a web page are selected through
the template, so you create several templates, each
containing only the contexts deemed suitable to the topic covered by the
individual web page. The command described above can then be used for each
template (with a different output file, of course) to create the set of
HTML pages. The appropriate data to include in the pages is automatically
selected by Velocity. An alternative approach would be to store the different
data sets in separate database schemata and use a simple default template
for all of them.
Future Directions
The set of tools contained in the URLManager bundle is fairly complete
for handling the tasks it was designed for. One really nice-to-have feature
would be a web GUI component to control the tools from within a browser;
for example, the user could enter new link data using a standard HTML form.
Such a web user interface could be designed using one of the popular frameworks
like Struts or Java Server Faces (JSF), and, in fact, the class structure of the
URLManager bundle has been designed with other applications using them
in mind. It should be straightforward to implement such a solution.
Other than that, CheckManagers for HTTPS and
FTP would also be nice to have.
About the Author
Dr. Matthias Laux is a senior
engineer working in the Global SAP-Sun Competence Center in Walldorf, Germany.
His main interests are Java and J2EE technology, architecture, and programming,
web services and XML technology in general, databases, and performance
and benchmarking. Although he also has a background in aerospace engineering
and HPC/parallel programming, today his languages of choice are Java
and Perl.
See Also
Download the URLManager package.
The Velocity Template Engine. An Apache Jakarta Project.
RFC 640: Revised FTP Reply Codes.
RFC 2060: Internet Message Access Protocol.
RFC 2251: Lightweight Directory Access Protocol (v3).
RFC 2616: Hypertext Transfer Protocol.
DBAccessor - A JDBC Wrapper Package by Dr. Matthias Laux.
Validating URL Links by John Zukowski.
Download the URLManager software.
Java, J2EE, J2SE, J2ME, and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
|