parsing - How to Parse a huge xml file (on the go) using Python -
i have huge xml file (the current wikipedia dump). xml having size of 45 gb represents entire data of current wikipedia. first few lines of file (output of more):
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://ww w.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/x ml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:la ng="en"> <siteinfo> <sitename>wikipedia</sitename> <base>http://en.wikipedia.org/wiki/main_page</base> <generator>mediawiki 1.21wmf6</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">media</namespace> <namespace key="-1" case="first-letter">special</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">talk</namespace> <namespace key="2" case="first-letter">user</namespace> <namespace key="3" case="first-letter">user talk</namespace> <namespace key="4" case="first-letter">wikipedia</namespace> <namespace key="5" case="first-letter">wikipedia talk</namespace> <namespace key="6" case="first-letter">file</namespace> <namespace key="7" case="first-letter">file talk</namespace> <namespace key="8" case="first-letter">mediawiki</namespace> <namespace key="9" case="first-letter">mediawiki talk</namespace> <namespace key="10" case="first-letter">template</namespace> <namespace key="11" case="first-letter">template talk</namespace> <namespace key="12" case="first-letter">help</namespace> <namespace key="13" case="first-letter">help talk</namespace> <namespace key="14" case="first-letter">category</namespace> <namespace key="15" case="first-letter">category talk</namespace> <namespace key="100" case="first-letter">portal</namespace> <namespace key="101" case="first-letter">portal talk</namespace> <namespace key="108" case="first-letter">book</namespace> <namespace key="109" case="first-letter">book talk</namespace> <namespace key="446" case="first-letter">education program</namespace> <namespace key="447" case="first-letter">education program talk</namespace > <namespace key="710" case="first-letter">timedtext</namespace> <namespace key="711" case="first-letter">timedtext talk</namespace> </namespaces> </siteinfo> <page> <title>accessiblecomputing</title> <ns>0</ns> <id>10</id> <redirect title="computer accessibility" /> <revision> <id>381202555</id> <parentid>381200179</parentid> <timestamp>2010-08-26t22:38:36z</timestamp> <contributor> <username>olenglish</username> <id>7181920</id> </contributor> <minor /> <comment>[[help:reverting|reverted]] edits [[special:contributions/76.2 8.186.133|76.28.186.133]] ([[user talk:76.28.186.133|talk]]) last version gurch</comment> <text xml:space="preserve">#redirect [[computer accessibility]] {{r c amelcase}}</text> <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1> <model>wikitext</model>
...and on
notice page element in tree. corresponds unique page in wikipedia. given xml consists of pages of wikipedia in form of page elements. need write parser in need extract value of title entry page pages of wikipedia , suppose (for simplicity) print them.
i trying build same using python (although open switch in language if offers solution). way know of use elementtree.
however, using function parse('file.xml') requires entire document first parsed , results outputted. evident, know entire xml consist of page elements. want program begin printing titles while parsing rest of xml. possible. if so, how?
edit note: cite example of extracting titles here keep things simple in question. however, need xml parsing features since need extract same in future.
what want event-based xml library, sends pieces parses incrementally, rather creating tree whole document. typical answer xml.sax stdlib module though i'm sure there many others.
Comments
Post a Comment