=begin == NAME loosesox.rb -- Loose and Simple Objects for XML Version 1.1.4 Sep. 8 2005 == COPYRIGHT Copyright (c) 2003 MIZUTANI Tociyuki All Rights Reserved. This is a free software. You can distribute/modify this under the terms of the GNU Lesser General Public License version 2 or later. == SYNOPSYS * Slice a XML manuscript into the line oriented peases as String. * Parser may yield nodes to a given block like as SAX by apache.org. * Parser may create a document tree like as DOM by W3C. * Any node in the document tree accepts a XML::Visitor instance. * XML::Document and XML::Element are a sub class of the build-in Array. * XML::Element is [XML::Tag, 1st node, 2nd node, .., XML::Endtag] * Recognize empty elements of HTML by tagname automatically: img, hr, etc. * Unrecognize missing endtag in HTML. * Other leaf node is a sub class of the build-in String. * So, they can receive all messages for the build-in String or Array. * Few compatibility with W3C DOM: nodeName, nodeType, and class names. * Do not split and do not parse internal-DTD in the DocumentType node. * CANNOT recognize the part of DTD: , , , etc. * CANNOT treat and raise error for any conditional inclusions. == CHANGES :Version 1.1.4 Sep. 8 2005 add raise NotSupport for //. :Version 1.1.3 Sep. 3 2005 fix DOCTYPE regex. :Version 1.1.2 Sep. 2 2004 change as a non-empty element in HTML parsing :Version 1.1.1 Oct. 30 2003 remove XML::Node#to_str() method for Ruby 1.8.0 :Version 1.1.0 Oct. 28 2003 change the module name of XML::Nodability to XML::Node change the method name of XML::Node#node? to XML::Node#is_node? change XML::Node#accept and walkin visit only if child.is_a?(XML::Node) change XML::Node#nodeName picking nodeName from its string at every time make a method XML::Node#to_s returning an instance of String add a function XML.node?(object[, class[, name]]) add a class XML::DocumentFragment add a class method XML::Element[', child nodes ..., '] add a class method XML::Document[child nodes ...] add a class method XML::DocumentFragment[child nodes ...] make XML::Node#is_node? raising TypeError unless class<=XML::Node make XML.node? raising TypeError unless class<=XML::Node :Version 1.0.1 Oct. 26 2003 add to XML::Element#emptyElement?() fix XML::Nodability#endtag?() cannot work correctly :Version 1.0.0 Oct. 23 2003 First publishing. == USAGE === catxml Get XML manuscripts from stdin and echo them out. require 'loosesox' document = XML.parse(ARGF.read) print document If you have a HTML manuscript without '' notation, then document = XML.parsehtml(ARGF.read) document = XML.parse(ARGF.read(),XML::HTMLmode) Two aboves are same to work. Another code using a block. require 'loosesox' XML.parse(ARGF.read) { |node| print node if node.document? } Using a visitor. require 'loosesox' class CatXML < XML::Visitor def visitDocument(node) print node end end visitor = CatXML.new document = XML.parse(ARGF.read) document.accept(visitor) === catbody For other example, we print body parts of the given HTML documents. Notice that the all elements in it must be closed by necessary endtag as the XHTML document from the limit of the parser. But an empty element may appears like '
' not '
'. require 'loosesox' document = XML.parsehtml(ARGF.read) html = document.find { |node| node.element?('html') } body = html.find { |node| node.element?('body') } print body Using above example, we get first body part in the first html element node. We hope to catch all body parts as like as 'cat' command, use the block. require 'loosesox' XML.parsehtml(ARGF.read) { |node| print node if node.element?('body') } Or using a visitor, require 'loosesox' class Catbody < XML::Visitor def visitElement(node) print node if node.element?('body') end end visitor = Catbody.new document = XML.parsehtml(ARGF.read) document.accept(visitor) === bodytext Get all of the text parts in body elements of HTML documents. First example using a visitor to the body element node. require 'loosesox' class TextPrinter < XML::Visitor def visitText(node) print node end end visitor = TextPrinter.new document = XML.parsehtml(ARGF.read) html = document.find { |node| node.element?('html') } body = html.find { |node| node.element?('body') } body.accept(visitor) Using a combination of a visitor and the State design pattern taking the advantages for more complex situations. Since the State objects may be Singletons, I write them as the Contant that has a singular method named 'reduce' to work out. Transitions between status occur at visiting body element node or at leaving from one. require 'loosesox' class TextPrinter < XML::Visitor Inbody = Object.new Outbody = Object.new def Inbody.reduce(node) print node end def Outbody.reduce(node) end # do nothing def visitDocumentBegin(node) @state = Outbody # set initial state end def visitElementBegin(node) @state = Inbody if node.element?('body') end def visitElement(node) @state = Outbody if node.element?('body') end def visitText(node) @state.reduce(node) end end visitor = TextPrinter.new document = XML.parse(ARGF.read) document.accept(visitor) Example using a block. A flag variable named 'inbody' shows coming the inside of the body element. It sets true, when we catch a 'body' tag at entrance of the body element. The parser yields tag node, childs node of the element, the 'body' element node in order. So that, flag 'inbody' sets false when we catch the 'body' element node at the end point of the element. Now, we get all of the text nodes at where 'inbody' indicates true. require 'loosesox' inbody = false XML.parsehtml(ARGF.read) do |node| inbody = true if node.tag?('body') inbody = false if node.element?('body') print node if node.text? and inbody end Same above but using the State design pattern. A global variable '$action' shows the current status object that has a method 'reduce'. It always checks conditions to work or to transit another state. require 'loosesox' $action = nil Outbody = Object.new Inbody = Object.new def Outbody.reduce(node) $action = Inbody if node.tag?('body') end def Inbody.reduce(node) if node.element?('body') $action = Outbody elsif node.text? print node end end $action = Outbody XML.parse(ARGF.read) { |node| $action.reduce(node) } =end module XML =begin == XML Document Objects * XML::Node -- the mix-in for all of the Loose and Simple Objects for XML * XML::Document, XML::Element, XML::DocumentFragment < Array + XML::Node * XML::Text, XML::Comment < String + XML::Node * XML::DocumentType, XML::CDATASection < String + XML::Node * XML::ProcessingInstruction < String + XML::Node * XML::Tag, XML::Endtag < String + XML::Node === Creators --- XML::Document.new() Create an empty document node instance. --- XML::Document.[]([...]) Create a document node instance with list of args. --- XML::DocumentFragment.new() Create an empty document fragment node instance. --- XML::DocumentFragment.[]([...]) Create a document fragment node instance with list of args. --- XML::ProcessingInstruction.new(s) Create a processing instruction node instance with the string. Example: node = XML::ProcessingInstruction.new( %|\n| ) --- XML::DocumentType.new(s) Create a document type node instance with the string. Example: node = XML::DocumentType.new( "" ) --- XML::Element.new(tag) Create an element node instance with the string such as ''. The method creates a tag node automatically. Example: node = XML::Element.new("

") node.push XML::Text("Now on test.") node.endtag("

\n") --- XML::Element.[](tag[, childs ...][, endtag]) Create an element node instance filling with list of args. The method creates a tag node automatically. For the un-empty element, endtag must be specified as string. Example: node = XML::Element["

",XML::Text("Now on test."),"

\n"] node = XML::Element[%|Foo|] --- XML::Tag.new(s) Create a tag node instance with the string. Example: node = XML::Tag.new(%|
\n|) node = XML::Tag.new(%|Foo|) --- XML::Endtag.new(s) Create a endtag node instance with the string. Example: node = XML::Endtag.new("
\n\n") --- XML::Text.new(s) Create a text node instance with the string. Example: node = XML::Text.new("if i < 10") --- XML::CDATASection.new(s) Create a pcdata section node instance with the string. Example: node = XML::CDATASection.new < ]]> EOS --- XML::Comment.new(s) Create a comment node instance with the string. Example: node = XML::Comment.new("\n") === Common methods and functions --- XML::Node#accept(visitor) Walk arround visitor from the receiver. At visiting a node, the corresponding visit method of visitor is choosed for the class of the node. --- XML::Node#walkin(visitor) Walk arround visitor from the receiver. At visiting a node, the node calls same visit method of visitor. --- XML::Node#nodeName DOM node name string of the receiver. :Name * '#document' for XML::Document * prcessingname for XML::ProcessingInstruction * typename for XML::DocumentType * tagname for XML::Element * tagname for XML::Tag * tagname for XML::Endtag * '#text' for XML::Text * '#comment' for XML::Comment * '#cdata-section' for XML::CDATASection --- XML::Node#nodeType DOM node type constant :Constant * XML::ELEMENT_NODE * XML::TEXT_NODE * XML::CDATA_SECTION_NODE * XML::PROCESSING_INSTRUCTION_NODE * XML::COMMENT_NODE * XML::DOCUMENT_NODE * XML::DOCUMENT_TYPE_NODE * XML::TAG_NODE * XML::ENDTAG_NODE --- XML.node?(instance) Is the instance a XML::Node? --- XML.node?(instance, class[, name]) Is the instance the class, and also its node name same as the name? The class must be a child class of XML::Node, or raise a TypeError. --- XML::Node#node?(class[, name]) Is the receiver the class? And is its node name same as the name? Use function XML.node?() safety for the instance of any class. The class must be a child class of XML::Node, or raise a TypeError. --- XML::Node#document? Is the receiver a document node? --- XML::Node#fragment? Is the receiver a document fragment node? --- XML::Node#instruction?([name]) Is the receiver a Processing instruction node? And is its node name same as the name? --- XML::Node#doctype?([typename]) Is the receiver a document type node? And is its node name same as the typename? --- XML::Node#element?([tagname]) Is the receiver an element node? And is its node name same as the tagname? Gives the tagname as 'title' or 'svg:rect' with name space prefix. --- XML::Node#tag?([tagname]) Is the receiver a tag node? And is its node name same as the tagname? Gives the tagname as 'title' or 'svg:rect' with name space prefix. --- XML::Node#endtag?([tagname]) Is the receiver a endtag node? And is its node name same as the tagname? Gives the tagname as 'title' or 'svg:rect' without a prefix slash. --- XML::Node#text? Is the receiver a text node? --- XML::Node#comment? Is the receiver a comment node? --- XML::Node#cdata? Is the receiver a CDATA section node? Avairable all Ruby built-in methods of String and Array. DOM nodeValue is not implemented. You can access an instance itself. === XML::Element additional methods --- XML::Element#endtag(s) push endtag string --- XML::Element#emptyElement?(mode = XML::HTMLmode) Is it an empty element? Checks closed tag with XML notation ''. An argument may omit to specify checking behaviors. * XML::HTMLmode as default, checking with tag name and markup notation. * XML::XMLmode checking only markup notation ''. =end # Constant for nodeType according to W3C's DOM ELEMENT_NODE = 1 TEXT_NODE = 3 CDATA_SECTION_NODE = 4 PROCESSING_INSTRUCTION_NODE = 7 COMMENT_NODE = 8 DOCUMENT_NODE = 9 DOCUMENT_TYPE_NODE = 10 DOCUMENT_FRAGMENT_NODE = 11 TAG_NODE = -1 ENDTAG_NODE = -2 HTMLmode = true XMLmode = false def XML.node?(obj, c=Node, name=nil) unless Node>=c raise TypeError, "XML.node?() only XML::Node>=Class" end if name obj.is_a?(c) and obj.nodeName == name else obj.is_a?(c) end end module Node def is_node?(c, name=nil) unless Node>=c raise TypeError, "XML::Node#node?() only XML::Node>=Class" end if name is_a?(c) and name==nodeName else is_a?(c) end end def document?() is_a?(Document) end def fragment?() is_a?(DocumentFragment) end def instruction?(name=nil) if name is_a?(ProcessingInstruction) and name == nodeName else is_a?(ProcessingInstruction) end end def doctype?(name=nil) if name is_a?(DocumentType) and name == nodeName else is_a?(DocumentType) end end def element?(name=nil) if name is_a?(Element) and name == nodeName else is_a?(Element) end end def tag?(name=nil) if name is_a?(Tag) and name == nodeName else is_a?(Tag) end end def endtag?(name=nil) if name is_a?(Endtag) and name == nodeName else is_a?(Endtag) end end def text?() is_a?(Text) end def comment?() is_a?(Comment) end def cdata?() is_a?(CDATASection) end def walkin(visitor) if document? or element? each { |n| n.walkin(visitor) if n.is_a?(Node) } end visitor.visit(self) end def to_s() ''+super end #def to_str() ''+super end #abstract accept(visitor) #abstract nodeName() #abstract nodeType() end =begin === XML::Document The root document container is an Array. It keeps child nodes. =end class Document < Array include Node def Document.[](*list) its = new its.push(*list) unless list.empty? its end def accept(visitor) visitor.visitDocumentBegin(self) each { |n| n.accept(visitor) if n.is_a?(Node) } visitor.visitDocument(self) end def nodeName() '#document' end def nodeType() DOCUMENT_NODE end end =begin === XML::DocumentFragment The document fragment container is an Array. It keeps child nodes temporally. The first element of this is String.new('') due to to_s() always returns String. =end class DocumentFragment < Array include Node def DocumentFragment.[](*list) its = new its.push(*list) unless list.empty? its end def accept(visitor) visitor.visitDocumentFragmentBegin(self) each { |n| n.accept(visitor) if n.is_a?(Node) } visitor.visitDocumentFragment(self) end def nodeName() '#flagment' end def nodeType() DOCUMENT_FRAGMENT_NODE end end =begin === XML::DocumentType Document type is a String. It may contain internal DTD. "\n" "\n .... \n ]>\n" =end class DocumentType < String include Node def accept(visitor) visitor.visitDocumentType(self) end def nodeName() if /\n" =end class ProcessingInstruction < String include Node def accept(visitor) visitor.visitProcessingInstruction(self) end def nodeName() if /<\?(\w+)/ =~ self $1.downcase else '#instruction' end end def nodeType() PROCESSING_INSTRUCTION_NODE end end =begin === XML::Element Element is an Array with XML::Tag and XML::Endtag. If it is empty Element, it becames [XML::Tag], otherwise [XML::Tag, 1st child node, 2nd one, .., last one, XML::Endtag]. h2:[h2:"

",Text:"Section 1. Example",h2:"

\n"] img:[img:" \"XML\n"] hr:[hr:"
\n"] =end class Element < Array include Node EmptyElement = { "area"=>true, "base"=>true, "basefont"=>true, "br"=>true, "bgsound" => true, "col"=>true, "frame"=>true, "hr"=>true, "img"=>true, "input"=>true, "isindex"=>true, "link"=>true, "meta"=>true, "param"=>true, "spacer"=>true } EmptyElement.default=false def initialize(s) super() s = Tag.new(s) unless s.is_a?(Tag) unshift s end def Element.[](*list) raise "Absent tag" if list.empty? its = new(list[0]) its.push(*list[1...-1]) if list.size >= 3 its.endtag(list[-1]) if list.size >= 2 its end def endtag(s) s = Endtag.new(s) unless s.is_a?(Endtag) push s end def emptyElement?(htmlMode = true) raise "Absent tag" if self.empty? if /\/>/ =~ self[0] true elsif htmlMode EmptyElement[self[0].nodeName] else false end end def accept(visitor) visitor.visitElementBegin(self) each { |n| n.accept(visitor) if n.is_a?(Node) } visitor.visitElement(self) end def nodeName() if not empty? and self[0].is_a?(Node) self[0].nodeName else '#element' end end def nodeType() ELEMENT_NODE end end =begin === XML::Tag Tag is a String. First node of the XML::Element. The nodeName is a tag name with a XML name space prefix. p:"

" hr:"


" dc:title: =end class Tag < String include Node def accept(visitor) visitor.visitTag(self) end def nodeName() if /<([:\w]+)/ =~ self $1.downcase else '#tag' end end def nodeType() TAG_NODE end end =begin === XML::EndTag Endtag is a String. Last node of the XML::Element. The nodeName is a tag name with a XML name space prefix. p:"

\n" dc:title:"
\n" =end class Endtag < String include Node def accept(visitor) visitor.visitEndtag(self) end def nodeName() if /<\/([:\w]+)/ =~ self $1.downcase else '#endtag' end end def nodeType() ENDTAG_NODE end end =begin === XML::Text Text is a String. In it special characters must be escaped. * '&' as '&' * '<' as '<' * '>' as '>' * '"' as '"' * ''' as "'" " Foo "bar" <baz>\n\nLine oriented.\n \n" =end class Text < String include Node def accept(visitor) visitor.visitText(self) end def nodeName() "#text" end def nodeType() TEXT_NODE end end =begin === XML::CDATASection Character data section is a String. It may contain special characters. "bar\n \n]]>\n" =end class CDATASection < String include Node def accept(visitor) visitor.visitCDATASection(self) end def nodeName() "#cdata-section" end def nodeType() CDATA_SECTION_NODE end end =begin === XML::Comment Comment not is in a character data section or a document type definitions. It is a String. "\n" "\n\n" =end class Comment < String include Node def accept(visitor) visitor.visitComment(self) end def nodeName() "#comment" end def nodeType() COMMENT_NODE end end =begin == Base class for Visitor === XML::Visitor A visitor walks around tree structures of node. Reaching the specific node, the node calls the ordering methods in the given visitor with single parameter self. You must make a sub class of this, and overwrites for your purposes. For the default, all methods do nothing conveniently. class MyVisitor < XML::Visitor def visitText(t) print t end end document.accept(MyVisitor.new) In using XML::Node#accept(visitor), a node calls corresponding methods of Visitor by the basis of double dispatching technic. On the other hand, in using XML::Node#walkin(visitor), all nodes call same methods 'visit' of Visitor as like as parser's calling given block. class DebugDumper < XML::Visitor def visit(node) p node end end document.walkin(DebugDumper.new) --- XML::Visitor#visit(node) The node calls from XML::Node#walkin(visitor). --- XML::Visitor#visitDocumentBegin(document) The document calls before walking in its child nodes. --- XML::Visitor#visitDocument(document) The document calls after walking in its child nodes. --- XML::Visitor#visitDocumentFragmentBegin(fragment) The document fragment calls before walking in its child nodes. --- XML::Visitor#visitDocumentFragment(fragment) The document fragment calls after walking in its child nodes. --- XML::Visitor#visitElementBegin(element) The element calls before walking in its child nodes. --- XML::Visitor#visitElement(element) The element calls after walking in its child nodes. --- XML::Visitor#visitTag(tag) --- XML::Visitor#visitEndtag(tag) --- XML::Visitor#visitText(text) --- XML::Visitor#visitComment(comment) --- XML::Visitor#visitDocumentType(doctype) --- XML::Visitor#visitCDATASection(cdata) --- XML::Visitor#visitProcessingInstruction(instruction) The node calls them at each visiting timing. Typically, document.accept(visitor) takes calling seaquences, (1) visitor.visitDocumentBegin(document) (2) visitor.visitProcessingInstruction(instruction) (3) visitor.visitDocumentType(doctype) (4) visitor.visitElementBegin(html) (5) visitor.visitTag(html) (6) visitor.visitElementBegin(head) (7) visitor.visitTag(head) (8) visitor.visitElementBegin(title) (9) visitor.visitTag(title) (10) visitor.visitText(text) (11) visitor.visitEndtag(title) (12) visitor.visitElement(title) (13) visitor.visitElementBegin(link) (14) visitor.visitTag(link) (15) visitor.visitElement(link) (16) visitor.visitEndtag(head) (17) visitor.visitElement(head) (18) ...omitted (19) visitor.visitEndtag(html) (20) visitor.visitElement(html) (21) visitor.visitDocument(document) =end class Visitor def visit(node) end def visitDocumentBegin(document) end def visitDocument(document) end def visitDocumentFragmentBegin(fragment) end def visitDocumentFragment(fragment) end def visitElementBegin(element) end def visitElement(element) end def visitTag(tag) end def visitEndtag(endtag) end def visitText(text) end def visitComment(comment) end def visitDocumentType(doctype) end def visitCDATASection(cdata) end def visitProcessingInstruction(instruction) end end =begin == XML Scanner --- XML.scan(s) { |node| process node } Line oriented scanner for XML parser. This slices entire XML manuscript into suitable peases. Each pease wraps into a corresponding object. When the object has made, yield it successfully. =end def XML.scan(s) buffer = s indent = "" until buffer.empty? kind = case buffer when /\A<\?.+?>\s*/m then PROCESSING_INSTRUCTION_NODE when /\A\]]+\[[^\]]+\]>\s*/m then DOCUMENT_TYPE_NODE when /\A]+?>\s*/m then DOCUMENT_TYPE_NODE when /\A\s*/m then CDATA_SECTION_NODE when /\A\s*/m then COMMENT_NODE when /\A<\/.+?>\s*/m then ENDTAG_NODE when /\A<\w.*?>\s*/m then TAG_NODE when /[^<]+/m then TEXT_NODE end buffer = $' pickup = $& if /[ \t]+\z/m =~ pickup pickup = $` nextIndent = $& else nextIndent = "" end pickup = indent+pickup indent = nextIndent yield case kind when PROCESSING_INSTRUCTION_NODE then ProcessingInstruction.new(pickup) when DOCUMENT_TYPE_NODE then DocumentType.new(pickup) when TAG_NODE then Element.new(pickup) when ENDTAG_NODE then Endtag.new(pickup) when TEXT_NODE then Text.new(pickup) when CDATA_SECTION_NODE then CDATASection.new(pickup) when COMMENT_NODE then Comment.new(pickup) end end end =begin === XML Parser --- XML.parse(s, mode = XMLmode) { |node| ... } Buildup a document object from given the entire XML manuscript string. The function may yield node to block like the SAX interface. If mode is HTMLmode, parser recognizes the HTML empty element by its name automatically without XML notation ''. In default, parser behaves XMLmode, in which it recognizes empty element only by XML notation rule. Non empty elements must be closed by endtag, or raise XML::ParseError. document = XML.parse(aFile.read) { |node| ... } NOTE: This version cannot treat conditional inclusion such as or at any position. =end class ParseError < RuntimeError; end class NotSupport < ParseError; end def XML.parse(s, mode = XMLmode) if /(\]]+\[).*\]\]>/m =~ s raise NotSupport, 'This version cannot treat conditional inclusion "'+$1+' .. ]]>"' end document = Document.new context = [document] scan(s) do |node| #$stderr.puts node.nodeName+":"+node.inspect.tosjis case node when DocumentType, Comment, Text, CDATASection, ProcessingInstruction context[-1].push node when Element context[-1].push node unless node.emptyElement?(mode) context.push(node) node = node[0] end when Endtag if context[-1].element?(node.nodeName) context[-1].push node node = context.pop else raise ParseError, "Mismatch Tag and Endtag :"+node.to_s end end yield node if block_given? end yield document if block_given? document end =begin --- XML.parsehtml(s) { |node| ... } Buildup a document object from given the entire HTML manuscript string. The function may yield node to block like the SAX interface. It recognizes empty element from tagname in W3C's HTML4.01 strict DTD. document = XML.parsehtml(aFile.read) { |node| ... } =end def XML.parsehtml(s) if block_given? parse(s, HTMLmode) { |n| yield n } else parse(s, HTMLmode) end end end