Do a diff on a Teamsite XML file
Do a reasonable diff on a TeamSite XML file
I recently needed to find out if there were any differences in a merge TeamSite XML file and what I originally had. We had lots of tickets go into the branch this was in and someone else did the merge, so ... When there was stuff that wasn't showing up in Grid, the TeamSite file was suspect.
Anyway, after looking and not finding any really reasonable tool that did what I wanted, I did find some interesting modules that could be helpful.
- the dictify function from Erik Aronesty on the stackoverflow question How to convert an xml string to a dictionary? was helpful in finding a couple of other cool tools
- xmltodict
- Dictdiffer
As noted in the code below, I didn't end up using xmltodict, but it is an interesting module and I'll probably want to look at again, if I ever really need to work with XML much.
Code or Examples
Here is the code that I came up with:
#!/usr/bin/env python
#### -------------------
#### from: https://stackoverflow.com/questions/2148119/how-to-convert-an-xml-string-to-a-dictionary
#### and other links from there
#
#### xmltodict - https://github.com/martinblech/xmltodict (not using, but interesting)
#### dictdiffer - https://github.com/inveniosoftware/dictdiffer
#### -------------------
import dictdiffer
import difflib
import xml.etree.ElementTree as ET
from copy import copy
import textwrap
from rich.console import Console
#### -------------------
console = Console()
#### -------------------
#### dictify is from Erik Aronesty from the stackoverflow url above
def dictify(r,root=True):
if root:
return {r.tag : dictify(r, False)}
d=copy(r.attrib)
if r.text:
d["_text"]=r.text
for x in r.findall("./*"):
if x.tag not in d:
d[x.tag]=[]
d[x.tag].append(dictify(x,False))
return d
#### -------------------
def split_to_width(width, text):
retLines = ""
for line in text.split("\n"):
retLines += "\n".join(textwrap.wrap(line, width)) + "\n"
retLines += "\n"
return retLines
#### -------------------
#### -------------------
def format_dict_diff_line(lineTup):
retFormatLine = ""
if _isChange(lineTup):
retFormatLine = format_change_line(lineTup)
else:
retFormatLine = format_addRemove_line(lineTup)
return retFormatLine
def _isChange(lineTup):
if lineTup[0] == 'change':
return True
return False
def format_change_line(lineTup):
retLine = ""
retLine += f"type: [yellow]{lineTup[0]}[/yellow]\n"
retLine += f"key: [green]{lineTup[1]}[/green]\n"
ldiff = lineTup[2]
retLine += "diff: \n\n"
retLine += f" [yellow]orig[/yellow]: {ldiff[0]}\n"
if len(ldiff) > 1:
retLine += f" [green]chng[/green]: {ldiff[1]}\n"
retLine += "\n\n"
return retLine
def format_addRemove_line(lineTup):
type = "not in original"
if lineTup[0] == "remove":
type = "not in changed"
retLine = ""
retLine += f"type: [yellow]{type}[/yellow]\n"
ldiff = lineTup[2]
retLine += f"key: [green]{ldiff[0][0]}[/green]\n"
retLine += f"value: {ldiff[0][1]}\n"
retLine += "\n\n"
return retLine
#### -------------------
#### -------------------
def getDictFromXmlFile(xmlFile):
with open(xmlFile) as xmlFile:
lines = xmlFile.read()
etXmlRoot = ET.fromstring(lines)
dictFromEt = dictify(etXmlRoot)
return dictFromEt
#### -------------------
def getGridCmsKeys(xmlDict):
"""
This will be VERY specific to the Grid TeamSite/CMS xml files.
They are in the format shown below. NOTE that it's ALWAYS in
a CDATA note and it ALWAYS has a trailing space before closing.
Not sure why, but it's always that way.
----------------------
<?xml version="1.0" encoding="UTF-8"?>
<content-set>
<content key="CmsKey1"><![CDATA[Text or whatever data - with trailing space ]]></content>
<content key="CmsKey2"><![CDATA[value here ]]></content>
</content-set>
"""
retDict = {}
# TODO: check to make sure has keys we expect here and throw if not
# handle it more gracefully, than just an exception
cmsContent = xmlDict['content-set']['content']
for item in cmsContent:
retDict[item['key']] = item['_text']
return retDict
def getGridDictFromXmlFile(gridXml):
elemTreeDict = getDictFromXmlFile(gridXml)
retDict = getGridCmsKeys(elemTreeDict)
return retDict
def getKeysNotInDict(orig, chng):
keysNotInOrig = orig.keys() - chng.keys()
if len(keysNotInOrig) == 0:
keysNotInOrig = {}
return keysNotInOrig
#### -------------------
def diffCmsXml(origFile, chngFile):
orgD = getGridDictFromXmlFile(origFile)
chgD = getGridDictFromXmlFile(chngFile)
notInOrig = getKeysNotInDict(orgD, chgD)
notInChng = getKeysNotInDict(chgD, orgD)
# get the actual differences now - from CMS keys
result = dictdiffer.diff(orgD, chgD)
console.print("\n================")
console.print("Here are the differences: \n")
console.print(f"keys in orig but not in chng: \n\t{notInOrig}")
console.print(f"keys in chng but not in orig: \n\t{notInChng}")
console.print("\n----------------\n")
for l in list(result):
prnLine = format_dict_diff_line(l)
console.print(f"{prnLine}")
#### -------------------
def main():
# origFile = "./RealTimeQuotesSubscriptionsCommon.xml"
# chngFile = "./RealTimeQuotesSubscriptionsCommon_merged.xml"
origFile = "./orig.xml"
chngFile = "./chng.xml"
diffCmsXml(origFile, chngFile)
print("DONE! \n\n")
#### =====================================================================
if __name__ == "__main__":
main()
Here is the orig.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<content-set>
<content key="headerTitle"><![CDATA[Get Real-Time Quotes ]]></content>
<content key="headerBoilerplate"><![CDATA[<p>
Real-time quotes are vital to trading and investing. That's why the securities
exchanges require TD Ameritrade to verify how you intend to use your account.
Answer a few questions, then read and/or sign the agreements.
</p>
<ul>
<li><b>If your professional status is confirmed,</b> you can sign up and subscribe to real-time quotes.</li>
<li><b>If your nonprofessional status is confirmed,</b> you'll receive real-time quotes at no cost to you.</li>
<li><b>If you choose to not sign the agreements,</b> market data in your account will be delayed by 15 minutes.</li>
</ul> ]]></content>
<content key="proSubscriptionCRDInfo"><![CDATA[Central Registration Depository(CRD) number (optional) ]]></content>
<content key="proSubscriptionAgreementBoilerplate"><![CDATA[To access real-time quotes, read and electronically sign the following exchange agreements provided by the <b>New York Stock Exchange</b> (NYSE) and the <b>Options Price Reporting Authority</b> (OPRA). Then sign up for real-time quotes via ]]></content>
<content key="cancelBtnCopy"><![CDATA[Go back and save ]]></content>
</content-set>
and, here is the chng.xml file:
<?xml version="1.0" encoding="UTF-8"?><content-set>
<content key="proSubscriptionCRDInfo"><![CDATA[Central Registration Depository (CRD) number (optional) ]]></content>
<content key="headerTitle"><![CDATA[Get Real-Time Quotes ]]></content>
<content key="headerBoilerplate"><![CDATA[<p>
Real-time quotes are vital to trading and investing. That's why the securities
exchanges require TD Ameritrade to verify how you intend to use your account.
Answer a few questions, then read and/or sign the agreements.
</p>
<ul>
<li><b>If your professional status is confirmed,</b> you can sign up and subscribe to real-time quotes.</li>
<li><b>If your nonprofessional status is confirmed,</b> you'll receive real-time quotes at no cost to you.</li>
<li><b>If you choose to not sign the agreements,</b> market data in your account will be delayed by 15 minutes.</li>
</ul> ]]></content>
<content key="proSubscriptionAgreementBoilerplate"><![CDATA[To access real-time quotes, first read and electronically sign the following exchange agreements provided by the <b>New York Stock Exchange</b> (NYSE) and the <b>Options Price Reporting Authority</b> (OPRA). Then sign up for real-time quotes via ]]></content>
<content key="tryAgainCopy"><![CDATA[Please try again to complete your update. ]]></content>
</content-set>
and, it generates the following output:
./test.py
================
Here are the differences:
keys in orig but not in chng:
{'cancelBtnCopy'}
keys in chng but not in orig:
{'tryAgainCopy'}
----------------
type: change
key: headerBoilerplate
diff:
orig: <p>
Real-time quotes are vital to trading and investing. That's why the securities
exchanges require TD Ameritrade to verify how you intend to use your account.
Answer a few questions, then read and/or sign the agreements.
</p>
<ul>
<li><b>If your professional status is confirmed,</b> you can sign up and subscribe to real-time quotes.</li>
<li><b>If your nonprofessional status is confirmed,</b> you'll receive real-time quotes at no cost to you.</li>
<li><b>If you choose to not sign the agreements,</b> market data in your account will be delayed by 15 minutes.</li>
</ul>
chng: <p>
Real-time quotes are vital to trading and investing. That's why the securities
exchanges require TD Ameritrade to verify how you intend to use your account.
Answer a few questions, then read and/or sign the agreements.
</p>
<ul>
<li><b>If your professional status is confirmed,</b> you can sign up and subscribe to real-time quotes.</li>
<li><b>If your nonprofessional status is confirmed,</b> you'll receive real-time quotes at no cost to you.</li>
<li><b>If you choose to not sign the agreements,</b> market data in your account will be delayed by 15 minutes.</li>
</ul>
type: change
key: proSubscriptionCRDInfo
diff:
orig: Central Registration Depository(CRD) number (optional)
chng: Central Registration Depository (CRD) number (optional)
type: change
key: proSubscriptionAgreementBoilerplate
diff:
orig: To access real-time quotes, read and electronically sign the following exchange agreements provided by the <b>New
York Stock Exchange</b> (NYSE) and the <b>Options Price Reporting Authority</b> (OPRA). Then sign up for real-time quotes via
chng: To access real-time quotes, first read and electronically sign the following exchange agreements provided by the
<b>New York Stock Exchange</b> (NYSE) and the <b>Options Price Reporting Authority</b> (OPRA). Then sign up for real-time
quotes via
type: not in original
key: tryAgainCopy
value: Please try again to complete your update.
type: not in changed
key: cancelBtnCopy
value: Go back and save
DONE!
Pretty useful. I'll probably want to make it a Cmd2 script and do such things as only print the changes, additions, or subtractions and the like. With a little work, it could probably be made to be a more generic XML diff tool.
Another thought, could have a path (something like a.b.c which would translate to, given the code above, xmlDict['a']['b']['c']) to make it easier to compare like-to-like on the nodes. Then, this could work more on a generic basis (you'd still need the paths) but ...
Anyway, interesting stuff.