09 March 2021

Montreal or Montréal? ETL and data cleansing

If you click the Agent Rankings By City tab at the top of this page, then scroll down, you will find not one but four separate tabulations for: Montreal, Québec; Montreal, Quebec; Montréal, Quebec; and Montréal, Québec.  Why isnt there just one tabulation combining those four city / province combinations as shown below (click to enlarge the image)?

Montréal agent filings

The above tabulation reflects CIPOs data as of 01-Mar-2021 and covers filings made between 06-Mar-2020 and 01-Mar-2021.  Note that this rolling year is offset by about two weeks from the rolling year results shown in the Agent Rankings By City tabulations.  [Ive indicated that the rolling year ends on 01-Mar-2021, rather than 06-Mar-2021, because 01-Mar-2021 was the latest date encompassed by CIPOs data when I queried my database on 06-Mar-2021.]

The answer to the Why isnt there just one tabulation question has to do with the way in which addresses are structured in CIPOs trademark .xml files and also in CIPOs online trademark database (recall that the .xml files and the online database are populated from the same source).  

Details pertaining to the trademark agent—if any—who files a particular application are encapsulated within the tmk:NationalRepresentative element contained within the .xml file that CIPO publishes for the application.  Lets look at that element as set forth in the four applications below.  These applications were all filed on 01-Mar-2021 and are therefore encompassed by the results tabulated above.  Also notice that the four possible Montreal / Montréal; Quebec / Québec combinations are covered.

Application No.

Agent

2088242

ROBIC (Montréal, Québec)

2088212

FASKEN MARTINEAU DUMOULIN LLP (Montreal, Quebec)

2088184

BENOÎT & CÔTÉ INC. (Montréal, Quebec)

2088189

GOWLING WLG (CANADA) LLP (Montreal, Québec)

Here are the tmk:NationalRepresentative elements from CIPOs .xml files for each of those four applications (click to enlarge the image):

tmk:NationalRepresentative elements


Notice that the red-boxed city / province portions of each agents office address within the tmk:NationalRepresentative element appear in the above table, in parentheses, after the agents name.  You will see the same city / province presentations for the respective agents if you click any of the application serial number hyperlinks in the above table to open CIPOs online trademark database page for the respective applications.

So, the composite tabulation above correctly captures Montreal agent filing rankings for the rolling year 06-Mar-2020 to 01-Mar-2021, right?  Wrong!  If you are familiar with the Montreal trademark agent community, you will notice an obvious omission of at least one firm from the list of 15 firms shown above.  Specifically, there is no listing for the Montreal office of the prominent firm Norton Rose Fulbright Canada LLP / S.E.N.C.R.L., s.r.l.  Norton Rose Montreal office filed 402 applications between 06-Mar-2020 and 01-Mar-2021, so that firm should be ranked #4 in terms of filings in the above tabulation.  Why is there no such listing?

The answer again has to do with the way in which the address of Norton Rose Montreal office is structured in the tmk:NationalRepresentative element contained within the .xml files published by CIPO for applications filed by Norton Rose Montreal office.  Here, for example, is that element as set forth in CIPOs .xml file for application no. 2087968 which was filed on 26-Feb-2021 by Norton Rose Montreal office:

tmk:NationalRepresentative element

Again notice the red-boxed city / province details.  In particular, notice that in all five of the tmk:NationalRepresentative elements shown above, the city portion of the agents address is encapsulated within the last one of the sequentially numbered com:AddressLineText elements which are in turn encapsulated within the com:PostalAddressBagcom:PostalAddresscom:PostalStructuredAddress elements.  Unfortunately, although the province, country code and postal code portions of each agents address are separately encapsulated within specific XML elements (i.e. com:GeographicRegionName, com:CountryCode and com:PostalCode respectively) there is no specific element dedicated to encapsulation of the city portion of the agents address.  One must work backwards through the sequentially numbered com:AddressLineText elements to find the city portion of the agents address.  However, you cannot reliably assume that the last one of those elements will always contain the city portion of the agents address and nothing but.

There are 3 sequentially numbered com:AddressLineText elements in the tmk:NationalRepresentative elements shown above for each of the Robic, Fasken, Benoît and Gowling firms; but just 2 such elements in the tmk:NationalRepresentative element shown above for Norton Rose Montreal office.  In the case of the Robic, Fasken, Benoît and Gowling firms, the city portion of each agents address is encapsulated—by itself—within the last (i.e. third) one of the sequentially numbered com:AddressLineText elements.  However, in the case of Norton Rose’ Montreal office, the last (i.e. second) one of the sequentially numbered com:AddressLineText elements encapsulates not only “Montreal” but also “Suite 2500, ” forcing us to expend some effort if we want to extract just the city portion of the address.  This may reflect the way in which the address of Norton Rose’ Montreal office was structured upon initial creation of that office’s agent entry for purposes of CIPO’s trademark application eFiling template.  Whatever the reason, this brings us to the important topic of ETL and data cleansing.

Transferring information from an .xml file into a database often involves a so-called ETL (extract, transform, load) procedure whereby the information of interest is extracted from the .xml file, optionally transformed in one or more ways and then loaded into the database.  Recall that .xml files contain nothing but plain text.  For example, you will find many plain text dates encapsulated within every one of CIPO’s trademark .xml files: filing dates, advertisement dates, allowance dates, registration dates, etc.  Instead of storing dates as plain text, it is expedient to transform the plain text date information into a date data type recognizable by your database system.  That facilitates date computations, e.g. How long did it take, on average, for applications filed by my firm in 2017 to progress from filing to registration?  If dates are stored in your database as plain text, then whenever you ask a question of that sort, every plain text date encompassed by your question must be converted from plain text into a date data type to facilitate the computation, significantly reducing the performance of your database system.

Transformation of plain text information into date, numerical or other data types is a very common aspect of most ETL procedures.  Other transformations, generally referred to as data cleansing can also be used to correct or prevent data inaccuracies.  For example, processing Suite 2500, Montreal so as to recognize the Montreal portion of the encapsulated information as the city portion of the address and store only that portion in the database is a type of data cleansing transformation.

So why didnt I do that?  As of 01-Mar-2021, only 1,367 trademark agents are represented in all of CIPOs 1,767,773 trademark .xml files.  Since one and only one address need be stored for each agent, it would not take much effort to detect and correct anomalies like the one discussed above in relation to Norton Rose’ Montreal office.

One problem is that the issue discussed above is not confined to Norton Rose Montreal office; there are similar problems with another dozen or so agent addresses.  A second problem, as alluded to above, is deciding which one of two or more alternatives is correct.  For example, should all occurrences of Montreal" be transformed to Montréal or vice versa?  What about Quebec vs. Québec; or Saint-laurent vs. St-laurent vs. Ville St-laurent?  A third problem involves typographical errors such as Kelowa or Montréakl; or Vacouver”.  All of these problematic examples—and more—will be found upon inspection of CIPO’s address data for the aforementioned 1,367 trademark agents (as of 01-Mar-2021).

Detecting and correcting any one of the foregoing anomalies by suitably transforming the data before loading it into a database is fairly straightforward.  Detecting and correcting all of them on an ongoing basis is another matter—bearing in mind that new anomalies can and do arise.  The necessary coding is straightforward in each case, but can become quite tedious.  Moreover, this discussion has focused only on CIPOs trademark agent address data, which I consider to be quite clean”, relatively speaking.

If the discussion is expanded to include applicant (or opponent, or Section 45 requesting party, etc.) address data many more anomalies will be found.  CIPO has made a special effort to structure Canadian addresses represented in its trademark .xml files.  No such effort has been made in relation to foreign addresses.  Consequently, database queries which take applicant geographical location into account are best confined to Canada.  Even then you need to be careful if you query at the Canadian province level to ensure that you capture both English and French spellings, e.g. British Columbia vs. Colombie Britannique”, etc.  If you want to query applicant geographical location at the Canadian city level you will have to contend with variations of the same city name, misspellings of city names, etc. for many cities.  Database queries which take foreign applicant geographical location into account should be confined to the country level; attempts to target such queries at the state or city level will be frustrated unless a major ETL / data cleansing effort is undertaken and applied to foreign address data each and every time one of CIPOs weekly trademark .xml datasets is processed into your database.

In establishing an ETL / data cleansing policy for address data, one must decide How far am I prepared to go in detecting & correcting anomalies?  In dealing with addresses contained within CIPOs trademark .xml files I have decided to do relatively little detection and correction of anomalies and focus my efforts elsewhere.  I have thus avoided a lot of tedious ETL / data cleansing work, but at the expense of having to live with anomalies stored in my database.  If I fail to take them into account then, as shown above, I may get incorrect results, e.g. if I query my database in relation to “Montreal” trademark agents without taking “Montréal”, “Suite 2500, Montreal”, etc. into account.

So, in summary, the Montreal agent filing rankings for 06-Mar-2020 through 01-Mar-2021 should appear as follows (click to enlarge the image):

Montréal agent filings

Theres one more thing.  If you look closely at the five tmk:NationalRepresentative elements shown above, youll see that each one of them contains an element similar to this one from CIPOs .xml file for application no. 2087968: <com:CommentText>5153</com:CommentText>.  The 5153 information encapsulated by that element is not exposed by CIPOs online trademark database—either as a searchable field or otherwise—but it is very useful as I will explain in a future post.