23 February 2021

Working with trademark XML data (XML shredding)

The previous post began by surmising that you have downloaded one of CIPO’s global or weekly .zip files and unzipped it into a separate folder, yielding ten to twenty thousand .xml files.  So now you’re set to do some sophisticated data analysis—right?

Well, no.  Apart from looking at the .xml files one by one as explained in the previous post, or doing simple text searches on groups of—or all of—the .xml files, you can’t really do much with the files by themselves.

To do anything interesting, you need to extract the tag-encapsulated information from the .xml files and put it into some sort of well organized database.

  If you do that, you’ll be able to do some much more interesting things with the data—click any of the tabs along the top of this page to see some examples.

“Shredding” is the process of extracting information from an XML file, optionally transforming the information in some way and loading the information into a database.  An XML shredder is a computer program capable of performing the shredding process.

.xml seen in Notepad++
For example, here is the same small portion of CIPO’s .xml file for application serial no. 2084791 DISPERSA that we saw in the previous post.  You will probably want to extract at least the word mark DISPERSA from the tmk:MarkVerbalElementText element and store it in your database.  An XML shredder “understands” the hierarchical arrangement of the tag-encapsulated information and is able to traverse the hierarchy to locate and extract the information of interest. 

There are 2 basic types of XML shredders: those which use a staging database and those which do not.  In the first case, the shredder extracts information from the XML file and loads it into the staging database.  Additional work is then required to transfer the information from the staging database into the target database.

Shredders that do not use a staging database load the information directly into the target database.  Such shredders are custom-coded to suit a particular application.

The commercially available shredders that I have seen use staging databases.  Typically, they analyze the structure (or “schema”) of the input XML file(s), construct a staging database having a matching schema, extract the tag-encapsulated information from the .xml files and load the information into the staging database.  A major advantage of this approach is that with very little effort by the end user, the XML information is quickly made available in database form.  Sometimes that’s all you need to get the job done; especially in a one-off situation.

Commercial shredders have some disadvantages.  Usually they operate on an “all or nothing” basis, meaning that all of the information from every element in every input XML file is loaded into the staging database.  That can be a problem if you’re working with tens of thousands—or millions—of XML files, especially if you're interested in only some of the tag-encapsulated information.

If your XML files are complex, for example if they contain hundreds of multi-level elements, then the schema of the staging database will be correspondingly complex—possibly it will be downright weird.  In that case you’ll spend a lot of time scratching your head, wondering how to work with that schema.  You guessed it: CIPO’s trademark .xml files typically have hundreds of multi-level elements.

Probably you’ll only be interested in the information contained in at most a few dozen of those hundreds of multi-level elements.  An “all or nothing” commercial shredder can’t be configured to select information from only the elements you’re interested in and ignore everything else—at least not the ones I’ve seen.  So you’re stuck with a massive staging database full of junk you’re not interested in.  Often you wind up investing a lot of effort deleting the junk so you can focus on the leftover information that actually matters.  And it’s still just a staging database—you still need to move the information into your target database.

Commercial shredders typically expect to encounter XML files consisting of the same invariant set of tag-encapsulated elements in every file, without exception (the information encapsulated within the tags may of course differ from file to file).  However, CIPO’s trademark XML files do not contain the same set of elements in every file.  Some elements exist in some files, but not in others.  There’s nothing odd about that—in some cases certain information is undefined in relation to the mark corresponding to a particular XML file.  For example, if an XML file pertains to a word mark, you won’t find any elements containing Vienna Classification information in that file.  But you will find elements containing Vienna Classification information in an XML file pertaining to a design mark.

In fact, none of CIPO’s trademark XML files contain every possible element—or anything close to it.  There are many, many different possible combinations of elements which may be included in or excluded from any particular one of CIPO’s trademark XML files, depending on the specific mark and circumstances (e.g. a unique combination of prosecution, opposition and/or cancellation events) to which that file pertains.  This element variability can wreak havoc on the staging database created by a commercial shredder, because the commercial shredder continually updates—and thereby further complicates—the staging database’s schema, as the shredder encounters files containing elements which have not previously been represented in the schema.

Shredders that do not use a staging database are custom-designed to work with specific XML files and with a specific target database of your own design.  In a future post I’ll discuss a possible target database design.  The custom design approach overcomes the disadvantages mentioned above in relation to commercial shredders.  You can feed arbitrarily large batches of files to a custom shredder—if you code it properly.  You can design a custom shredder to traverse complex, non-repeating (i.e. variable), multi-level element structures.  You can pick and choose the elements from which you want to extract information and ignore everything else.  The trade-off is the significant time and effort required to design, build and debug a custom shredder.

The intricacies of designing and coding a custom XML shredder are well beyond what can be conveyed in a blog post—or even multiple posts.  However, as coding tasks go, creating a custom XML shredder is not especially difficult, if you have prior coding experience involving XML files and databases or have access to coders with such experience.