I wonder just how much of the billions of bytes of storage in service today contain useful data? Not much I suspect. Even worse though, how much useful data is there which no one knows about?
In the early days the data related to formal business transactions, which was relatively small volume by today’s standards. Even then the data was so inter-related to specific data processing applications that it was very difficult to access it for any other application, particularly ad-hoc user queries. But today we have far more complex data processing (ERP) systems, with lots more data and huge amounts of historic data stored in data warehouses.
But it doesn’t stop there. Everybody seems to think it essential to write notes and letters with word processors, and with PCs capable of storing gigabytes of data, these notes accumulate. Even though most of them are simple textural letters, they are encoded in graphical formats, in order to look pretty, which means that instead of a few hundred bytes of data, these notes are kilobytes big. To this situation we have to add endless spreadsheets and graphical presentations.
Now it gets worse. The Internet is unavoidable and so HTML pages are being added to the mass of WP documents. There is a big difference however in that most WP documents are trivial and useless, while most HTML pages are related directly to the business. They contain data that is exposed to the public and as such represent the company in the eyes of that public. At the moment this is not too big a problem because the majority of Web pages are so badly designed that errors in data are lost. Most Web sites today show the company in bad light, but with experience they will get better. When that happens the accuracy of the content of the pages will become critical. Now then is the time to pay attention to the content of Web pages.
The early Web systems were simple query response applications, allowing a user to browse through a set of pre-prepared "brochure pages". It did not take long however before it became a problem to keep the pages up-to-date. Thus techniques matured which allowed scripts to be added to the Web servers which could extract data from other systems. Often a company had to produce the same information in multiple formats, WP, PDF and HTML in particular. Sometimes the same data had to be translated and presented in multiple languages. This has proved to be a horrible task and it is all too common to find inconsistent information. It is difficult enough keeping the core data accurate, with all systems updated in synchronism, and the new requirements are adding to the problem.
It has now become important to focus some attention on the problems of managing the content of all systems and Web systems in particular. This is a massive task and few companies are doing enough about it. There will be a lot of isolated efforts in the first instance and we will have to try and integrate them at a later stage. For instance there has already been a lot of work done with transaction data because of the growth of data warehousing. Data in the various systems was defined in a variety of dictionaries, serving a mixture of CASE tools, databases and operational systems. Somehow these have had to be coordinated in order to get clean, consistent data into the warehouse. It has been a hard task, and I wouldn’t claim that there are any wonderful solutions to the problem even today. With documents and Web pages there is a potential solution, but it is revolutionary, and that is to convert all the core information into XML format. If that is done then HTML pages, PDF and word processor files, as well as data records can all be generated from one input source. This means accepting that the word processor is obsolete and that XML editors are to replace them; this will happen eventually but it will take a long time.
There are a few products on the market which have developed from e-commerce application suites, which are targeted at Web content management, integrated with ERP systems and data warehouses. They are better than nothing, particularly for any company committed to selling to the public over the Web. It will probably be even more important to keep content accurate as B2B applications progress. These tools are essential in order to automate the synchronisation of data with what the customer sees. There is nothing to be done about all that WP rubbish, forget it and wait for XML editors!