将HTML转成XHTML并清除一些无用的标签和属性（html会被xhtml取代吗）

网友投稿 702 2022-08-28

介绍

这是一个能帮你从HTML生成有效XHTML的经典库。它还提供对标签以及属性过滤的支持。你可以指定允许哪些标签和属性可在出现在输出中，而其他的标签过滤掉。你也可以使用这个库清理Microsoft Word文档转化成HTML时生成的臃肿的HTML。你也在将HTML发布到博客网站前清理一下，否则像WordPress、b2evolution等博客引擎会拒绝的。

它是如何工作的

里面有两个类：HtmlReader和HtmlWriter

HtmlReader拓展了著名的由Chris Clovett开发的SgmlReader。当它读取HTML时，它跳过所有有前缀的节点。其中，所有像、、等上百的无用标签被滤除了。这样你读取的HTML就剩下核心的HTML标签了。

HtmlWriter拓展了常规的XmlWriter，XmlWriter生成XML。XHTML本质上是XML格式的HTML。所有你熟悉使用的标签——比如、
和

HtmlReader

HtmlReader很简单，下面是完整的类：

////// This class skips all nodes which has some

/// kind of prefix. This trick does the job

/// to clean up MS Word/Outlook HTML markups.

///public class HtmlReader : Sgml.SgmlReader

{

public HtmlReader( TextReader reader ) : base( )

{

base.InputStream = reader;

base.DocType = "HTML";

}

public HtmlReader( string content ) : base( )

{

base.InputStream = new StringReader( content );

base.DocType = "HTML";

}

public override bool Read()

{

bool status = base.Read();

if( status )

{

if( base.NodeType == XmlNodeType.Element )

{

// Got a node with prefix. This must be one

// of those "" or something else.

// Skip this node entirely. We want prefix

// less nodes so that the resultant XML

// requires not namespace.

if( base.Name.IndexOf(':') > 0 )

base.Skip();

}

return status;

}

HtmlWriter

这个类是有点麻烦。下面是使用技巧：

重写WriteString方法并避免使用常规的XML编码。对HTML文件手动更改编码。

重写WriteStartElementis以避免不被允许的标签写到输出中。

重写WriteAttributesis以避免不需求的属性。

让我们分部分来看下整个类：

可配置性

你可以通过修改下面的部分配置HtmlWriter：

public class HtmlWriter : XmlTextWriter

{

////// If set to true, it will filter the output

/// by using tag and attribute filtering,

/// space reduce etc

///public bool FilterOutput = false;

////// If true, it will reduce consecutive with one instance

///public bool ReduceConsecutiveSpace = true;

////// Set the tag names in lower case which are allowed to go to output

///public string [] AllowedTags =

new string[] { "p", "b", "i", "u", "em", "big", "small",

"div", "img", "span", "blockquote", "code", "pre", "br", "hr",

"ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"};

////// If any tag found which is not allowed, it is replaced by this tag.

/// Specify a tag which has least impact on output

///public string ReplacementTag = "dd";

////// New lines \r\n are replaced with space

/// which saves space and makes the

/// output compact

///public bool RemoveNewlines = true;

////// Specify which attributes are allowed.

/// Any other attribute will be discarded

///public string [] AllowedAttributes = new string[]

{

"class", "href", "target", "border", "src",

"align", "width", "height", "color", "size"

};

}

WriteString方法

////// The reason why we are overriding

/// this method is, we do not want the output to be

/// encoded for texts inside attribute

/// and inside node elements. For example, all the

/// gets converted to &nbsp in output. But this does not

/// apply to HTML. In HTML, we need to have as it is.

//////public override void WriteString(string text)

{

// Change all non-breaking space to normal space

text = text.Replace( " ", " " );

/// When you are reading RSS feed and writing Html,

/// this line helps remove those CDATA tags

text = text.Replace("", "");

// Do some encoding of our own because

// we are going to use WriteRaw which won't

// do any of the necessary encoding

text = text.Replace( "<", "<" );

text = text.Replace( ">", ">" );

text = text.Replace( "'", "'" );

text = text.Replace( "\"", ""e;" );

if( this.FilterOutput )

{

text = text.Trim();

// We want to replace consecutive spaces

// to one space in order to save horizontal width

if( this.ReduceConsecutiveSpace )

text = text.Replace(" ", " ");

if( this.RemoveNewlines )

text = text.Replace(Environment.NewLine, " ");

base.WriteRaw( text );

}

else

{

base.WriteRaw( text );

}

WriteStartElement: 应用标签过滤

public override void WriteStartElement(string prefix,

string localName, string ns)

{

if( this.FilterOutput )

{

bool canWrite = false;

string tagLocalName = localName.ToLower();

foreach( string name in this.AllowedTags )

{

if( name == tagLocalName )

{

canWrite = true;

break;

}

if( !canWrite )

localName = "dd";

}

base.WriteStartElement(prefix, localName, ns);

}

WriteAttributes方法: 应用属性过滤

bool canWrite = false;

string attributeLocalName = reader.LocalName.ToLower();

foreach( string name in this.AllowedAttributes )

{

if( name == attributeLocalName )

{

canWrite = true;

break;

}

// If allowed, write the attribute

if( canWrite )

this.WriteStartAttribute(reader.Prefix,

attributeLocalName, reader.NamespaceURI);

while (reader.ReadAttributeValue())

{

if (reader.NodeType == XmlNodeType.EntityReference)

{

if( canWrite ) this.WriteEntityRef(reader.Name);

continue;

}

if( canWrite )this.WriteString(reader.Value);

}

if( canWrite ) this.WriteEndAttribute();

结论

示例应用是一个你可以立即用来清理HTML文件的实用工具。你可以将这个类应用在像博客等需要发布一些HTML到Web服务的工具中。

原文地址：http://codeproject.com/Articles/10792/Convert-HTML-to-XHTML-and-Clean-Unnecessary-Tags-a

标签：示例

暂时没有评论，来抢沙发吧~

将HTML转成XHTML并清除一些无用的标签和属性（html会被xhtml取代吗）

小程序页面之间进行传值的操作办法

解锁玩具小程序的开发密码

关于小程序中 data- 的详细解析

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计