PHP, XML, and Character Encodings: 【zz】

上一篇 / 下一篇  2007-03-02 11:37:44 / 个人分类:LAMP

http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss
^;eH0u5^zGs0木铎校园 BBS 社区 {,k[ w8I]U+A[ I%l

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Geekism

"wkc5dE'^6^O0Update:This code has been finalized and debugged, and is now shipped as part ofMagpieRSS 0.7! Sadness and rage no more!木铎校园 BBS 社区^ D,z0d-dNsrc

E#s ES:QH-\w0So I have this little program, calledFeed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called byMagpieRSS, the RSS and Atom parser used by FoF.木铎校园 BBS 社区+a7@f q&}#vz$o:K

ZYY&~Q,cw,y4q2W0Here’s how Magpie was creating the XML parser:

0qrSEX gZ0

,w2it~%t)q0$parser = xml_parser_create();

"A-U&L+L'yr"p0

0w1UJe#p mJ0Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

BB%[D;SHJ0
木铎校园 BBS 社区E&] MII

$parser = xml_parser_create();木铎校园 BBS 社区gl7Ftd rO#sv
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
木铎校园 BBS 社区?9IN7s"CE!uh5N/{M+z

木铎校园 BBS 社区5a#V,Cx|$J _-W5t

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”木铎校园 BBS 社区9SN!J~ tP5gv e

木铎校园 BBS 社区X#Y}7s9S

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some moredocumentationandbug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

'cy3A3cy/S NA9w0

UFt&c$C6^9s*A.O(Q3cx0$parser = xml_parser_create("EBCIDIC");木铎校园 BBS 社区u P T$M;myc ?S
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

.lMWkL&k"a0

"| U b8fFg0This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

R ]6mul)fXxp0木铎校园 BBS 社区eyx"xZ f!O|

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!木铎校园 BBS 社区p z:?6P ~

Ck%}f}h0$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';木铎校园 BBS 社区 Cy,C7r"u&}:|D

-i(d } ox%@\0if (preg_match($rx, $xml, $m)) {
ZK } Rx ]&n1A0  $encoding = strtoupper($m[1]);
)@(Lj4Ml ? G[0} else {木铎校园 BBS 社区)ZX(@[;c/x1G `
  $encoding = "UTF-8";
%\oUP0Yv\0}

HZ0}7X9n0

u/Y `V9i ZiHG0That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.木铎校园 BBS 社区 ?bq7T]m T;r7G:Zv0L

,O*E9XTp9r(q6?Zu0So the full code is now:木铎校园 BBS 社区;|!lAc7p-F!w$`

木铎校园 BBS 社区!MF'y I3MMf_

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

UX"sU7Ejw A"\/|O-u0木铎校园 BBS 社区!yD%O {6Nd

if (preg_match($rx, $xml, $m)) {
:}$}${&A(TG#t0  $encoding = strtoupper($m[1]);木铎校园 BBS 社区u(T$\0x8uh^ k6C
} else {木铎校园 BBS 社区4M5tB0zyugmo
  $encoding = "UTF-8";木铎校园 BBS 社区/c'vk5r/i.q_+K8N)e
}

4X:l]A]Q0

hv#q b-R&D-Y v i0$parser = xml_parser_create($encoding);木铎校园 BBS 社区XN8]C]d$h:E%A
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

(?b/zZ]k)^E1\H:v!~0
木铎校园 BBS 社区Oxg IU*N`

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

6BKeJM!BA0木铎校园 BBS 社区d3RE0N5}g*UH

Even PHP 5 won’t help here, when it is released: Itsort-ofsupports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.木铎校园 BBS 社区LRI3I9v?q3\eE

木铎校园 BBS 社区L C/WU%Ed^#US/}$b

So I searched the PHP docs some more, and came up with a potential solution:mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that saysencoding="utf-8"and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

W `uEj{AQ/YT0
木铎校园 BBS 社区1v'Y F dz

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

0q&R)sfK m&ts*xS0

QktQ)yvZ0if (preg_match($rx, $source, $m)) {
kbv8}3Rw!ZDG c0  $encoding = strtoupper($m[1]);木铎校园 BBS 社区S(WJv.L6?"[ mpp
} else {木铎校园 BBS 社区I`.z6v B%I4H
  $encoding = "UTF-8";
;mRH4d`Z C:u0}木铎校园 BBS 社区AFD)ug;D,y,Y:@*D r}

木铎校园 BBS 社区,@ ws0G0nBMJ

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
},a"V-|&a*~0  $parser = xml_parser_create($encoding);木铎校园 BBS 社区 ngU:p"C\
} else {木铎校园 BBS 社区;L;|(J|a@9[X

4De`Y/K'g$L+n0  if(function_exists('mb_convert_encoding')) {
7xxe;e @B%|:dE0    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
tYR vrSr0  }木铎校园 BBS 社区r,S F4MLf

3F:bU-D)M:Pks0  if($encoded_source != NULL) {木铎校园 BBS 社区4p9T Q3Z}
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);木铎校园 BBS 社区/Lp0o Ky_%g uB
  }

vtGR(\ sW*B6F0

1I~,Q1f;d^9Ca0  $parser = xml_parser_create("UTF-8");木铎校园 BBS 社区%jF;]8jj"?
}

9N+r,^I-z3U|0

z&tX,z)f5g,w6|H!VO0xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

d w B| TuN%l0

%W'b@RT0Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.

/E Ke?_@%^{0

D7Z(S q%e$@.n0

Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

f9j7V}]+tJ b0

g{ ?%I \N2qmz?5d0$parser = xml_parser_create("");

g'?$E/}Vfq\0

2T ^L"P`L0Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:木铎校园 BBS 社区T v2oj9yW.x'A3J

木铎校园 BBS 社区o|gE$M u9l

$parser = xml_parser_create("");
l&C$[)Y QI6\7@0xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
木铎校园 BBS 社区\c'Z&KK

.YmN6BPu}7sk4j4E0Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

{)b$Z!]MBq1S0

y6[O U2f'i8C{ B0At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

hl KK1Q4w0
木铎校园 BBS 社区|A ^P$ArOp
64 Feedbacks zu "PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss"

Phil Ringnalda

o:[S6Y(A:_&M;T3^3Y0If you really want to cover all the bases, don’t just look for mb_convert_encoding(): you can also look for (and rarely find) iconv() and recode(). Then, on *nix, try to shell_exec(’iconv…’), though you’ll fail in safe mode and most shared hosts disable shell_exec. But, there’s still one last hope! Most of them don’t realize that they should also disable the strange and terrible proc_open(), so you can actually fork a process to run iconv, and open input and output pipes to feed and read.木铎校园 BBS 社区'J'L w2T,K)F Q p~Q

木铎校园 BBS 社区?uW]av7C/Epmr

Or, sigh, write ten lines of Python to call Mark’s Universal Feed Parser and return the output as a PHP include. Sometimes, PHP really ticks me off.

uv!OZA,i#K0
木铎校园 BBS 社区#o*Z{{A |0U+C
木铎校园 BBS 社区`9cYj3qOe+h/@eBD

steve

木铎校园 BBS 社区A%A5Q5_Z w?;a

OOooooo… I didn’t know about those!

7ST B)?8`8d w(y1s8~0木铎校园 BBS 社区XC0}O8J'z;X

Or I could just create a web service:http://xml-transcoder.com/?to=utf-8&url=http://nasty.feed/in/EBCIDIC/or/some/such. Then you’d just seamlessly subscribe to the transcoded version of the feed.

X_o:Zcm'u8V0
木铎校园 BBS 社区$n/z/Hu/?$\

&I8a+@,r3c%gh[.Hw:h0

isis

木铎校园 BBS 社区 gT IW3V6w H2~

good job and funny title.

Q(PB:^U+@ t]0
木铎校园 BBS 社区't#P N6`_0t?:R+w

~kC;f6ACIC ?p0

steve

木铎校园 BBS 社区REe;\w

Thanks, isis. I used your feed (since it’s big5) as one of the test cases!木铎校园 BBS 社区-U BR+FwoK


}'R%wh1t `I-S0
XF@5WR2r6s]0

zonble

木铎校园 BBS 社区C7?%aWI

Steve. May you permit me to translate your post into Chinese and share it to the Chinese readers? I consider that there might be lots of people in Asia would like to know how to handle with the International characters while programing PHP.木铎校园 BBS 社区,_ Q(aHDiY

木铎校园 BBS 社区"H4f$P;t'J,]9J.o
木铎校园 BBS 社区0iyaa;kR

steve

cVe8? T;L0Yes! You certainly can! If you can wait until Monday, you can post your translation of this article along with a pointer to Feed on Feeds 0.1.7 which will contain the working code.木铎校园 BBS 社区x:t!jFd\"|5[X

木铎校园 BBS 社区T3R/N$yi5l2N
木铎校园 BBS 社区|]?0Ew+VD,wS5N/T

Mark Wu

X"|!l?#B`g0Hi Steve:木铎校园 BBS 社区$X;rGt3SE

9Dms9R+b:S7G0I just plan integrate the FoF into pLog, this tale really help me learn a lot about PHP’s stupid encoding …

_q vc'D^ gr7R:r.N0木铎校园 BBS 社区!zB b&k^)hY

Regards, mark木铎校园 BBS 社区4[`5J:ei(OL? n


0Vl"?I7Coa"T A0木铎校园 BBS 社区4^R Me%t8BKp

Mark Wu

木铎校园 BBS 社区a;A9K^9Ku.IY]

Only one thing, my ISP does not support iconv … my god … How can I do?木铎校园 BBS 社区f B!L y&h


KabCTohB0木铎校园 BBS 社区g#D$U/~0Q6ur

steve

木铎校园 BBS 社区/f e/gl~ RE P+rT

Uh oh… how about ‘mbstring’? If you don’t have iconv or mbstring, then you will only be able to work with feeds that are in UTF-8, ISO-8859-1, or ASCII.

M@ hxyR0

[1h#@1@'Ez"Cm"v)A0木铎校园 BBS 社区i"^i'r~4S s

Thomas Clavier

#Y&\yy:~(S@'F4@ i0for me it isn’t a good idea to search charset in xml header. it’s good to search in http header because for w3c if charset it isn’t specify in http header you can use xml charset.

hn5j2u#@W0

},Z7c{7Y+`c0http://www.w3.org/TR/WD-html40-970708/charset.html木铎校园 BBS 社区n~&F#x1~|'iV

木铎校园 BBS 社区2l_0zn.gF

5q[I,ZGp$u-Jph0

steve

木铎校园 BBS 社区O6bT(d E\9Ck!n"Q

To really get it right you have to checkboth. In practice, I haven’t yet found any feeds where checking the headers is necessary or even helpful.木铎校园 BBS 社区T&Mq^Gi*v"L


-j(Y:?{N:d0EU[0木铎校园 BBS 社区5_ Mq-j ?0f b!t9b

Mark Wu

木铎校园 BBS 社区`$x3m5q(H&z

Hi Steve:

r+?"wc a[Hr&^0木铎校园 BBS 社区 I8oW:~ mq?5[

Thanks, It works,I already asked my ISP install iconv …. It works..now!!木铎校园 BBS 社区u;l~#q@ T

S$D?:y:`J)c"y0And, I just only use the hacked magpie-rss that shiped with FoF. With it, I can let pLog support the encoding, convert GB & BIG5 to UTF without change any code … really thanks!!

L9~/a2p ` a;u[ IL0

f;t|U7S(rZnew0Please go my site to see the result, you really help me a lot!! ^-^.

1HjPd(u0木铎校园 BBS 社区icu"_lx|*z

RSS Feeds by Site ==>http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedulebysite

DN-w\ n$j)lR0木铎校园 BBS 社区-jHEx-FK

RSS Feeds by Time ==>http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedule木铎校园 BBS 社区D*x I-E)E^Nw

[(W/ME"[0And, may you allow me to submit the my code with your hacked magpie-rss version to pLog ??

&]_LX U/Y0木铎校园 BBS 社区#`H ^-w }o#h?.A

Thanks!木铎校园 BBS 社区 ?x)X(lt:e%s |`3VFFv

t,L!{S5]R Nm0Regards, Mark木铎校园 BBS 社区 O g&T{-ht m j


2H;G9f%T9^xn%R#xG0
,f$mG.T3m0

steve

木铎校园 BBS 社区c,`#PkU6Z$D!E5}:[

Mark: Great news! The code is GPL, so you can share it with anybody you like, as long as they follow the terms of the GPL as well. The author of Magpie is looking at these changes now, and refining them, and they may be included in a future “official” version of MagpieRSS.木铎校园 BBS 社区}/R!z4s?U8|b!tt


I%^1KP"wt[0木铎校园 BBS 社区%n X|e0bmd

Anonymous

IJ(a?/Zo X)Ia0god. this is a boring blog.木铎校园 BBS 社区:tn |.jP/[*P
possibly, because i’m stupid and know nothing of php encoding.
nZ f%g#T%m!^[q0but for all our sakes, go to a party and get hammered.

"c}nb2v {r"U5K:a0

SZEg Vws6XKr0木铎校园 BBS 社区!sr^&Im+`C:s

David

#Z#o&[&]*V_H Z0Thomas makes a good point, but it’s even worse than that–not only are there character encodings specified by both HTTP and the XML prolog, but either or both of them could be completely wrong. I would imagine there are many feeds served as ISO-8859-1 that are actually windows-1252 (or whatever it is) and they just don’t happen to contain any of the characters that are different between the two formats, yet. When one of them does, your code might handle it, or maybe it’ll start spitting out gibberish again. And if someone gets UTF-8 and UTF-16 mixed up, I think you could wind up with everything shifted off by a byte, and now the feed is completely unreadable again.

2@TUArI~0木铎校园 BBS 社区9A8G)E'e#P,U\Z

Basically, until we can convince everyone to use UTF-8/16/32 for everything, this will be a bloody pain in the arse. It looks like PHP just makes it even more painful. I would second Phil’s recommendation: use Mark’s Universal Feed Parser, and when something breaks, just get him to fix it. :-D

;H \-e1j#N Y y|0
木铎校园 BBS 社区#_&L,tz6?2R4D4l
木铎校园 BBS 社区)nP5~)h x1N

steve

木铎校园 BBS 社区~-d5DLL,lF4`s

At this point I’m still in the “get it right when the feed is right” stage. FoF still doens’t even do that, all the time. Once I’ve got that one licked, I may move on to “get it right even when the feed doesn’t”.

Gx5}1h!Z`0
木铎校园 BBS 社区\5RS|*TD8XG)N |~

E&Vuc`p-E0

So Much Geek, So Little Time » Reject Incorrectly encoded Pingbacks

peJ)?8})U |+qG$Q0[…] ject Incorrectly encoded Pingbacks Filed under: Life — unteins @ 12:11 pmThis articlehas some info about […]木铎校园 BBS 社区D&L A6Q y4r m


z S*@@e0
Zfy;T@x$iO0

Mark Wu

木铎校园 BBS 社区"iVd;r-JI+KDiXu'f

Hi Steve:木铎校园 BBS 社区6L"u"aBu

木铎校园 BBS 社区*Y+?!e1{tS ^n

I wrote a blog about pLog RSSFetcher Plug-in. Thanks for such good work.木铎校园 BBS 社区%V)ulI&qi:B"^-M o

+ZsBNYUm'z}N0http://blog.markplace.net/index.php?op=ViewArticle&articleId=119&blogId=1木铎校园 BBS 社区3vcHW5hj!Fb3Z:L;K

木铎校园 BBS 社区;b Ix_0d V2Y,_3U] J

Regards, Mark

-_m} NMeF+P0

U?t9NJH%l0木铎校园 BBS 社区1\TkB^*jB

Harry Fuecks

木铎校园 BBS 社区,X;\%M jh7vm

Great post. Many thanks. Seems many php developers are largely oblivious to character encoding issues. Looking at some of the feed generation libraries out there, seems it’s a similar story - last time I looked onlyRssWriterpays any attention to UTF8. The author is Portugese I believe, which may be why…

du3O4e2y0g2_0

i"?N@(JP v2n0Gives me some ideas for further features forHTMLSax- right now it shouldn’t (not that I’ve tested carefully) choke on anything but also doesn’t support the user by taking care of encoding issues.

E[!hR7G;Y#F0
木铎校园 BBS 社区 eM U[@#lt[8C
木铎校园 BBS 社区:U^"t_0ga3F

inertia

木铎校园 BBS 社区O3X?6Q4c} l*V6V-s

hi Steve,木铎校园 BBS 社区.O+P.b UyG&i

木铎校园 BBS 社区 Ps!R;v0nA

I heard this good tool from isis, and install it on the share hosting to test. And one thing I can’t figure out is that after reading all docs carefully and asking my ISP had mbstring and iconv both complied, I still can’t fetcher big5 blogs, ex isis’s blog. Do you thnik where may I get worng?木铎校园 BBS 社区3AP7?fg"S

木铎校园 BBS 社区%uUg"O;sM

I know this porblem is guite “ambiguous”, say, the ISP didn’t compiled well ,or some installation step got wrong. but I also wish get some ideas form you.木铎校园 BBS 社区?1P a*x&d%NS

木铎校园 BBS 社区z5rXqI3^5EL/z2Z

regards
2i:e5Z*k3KN0inertia

I0sW5S MA&r(O0

yi J? o!I0G y kA0
3E'M#M {:~0

Peter Van Dijck\'s Guide to Ease

木铎校园 BBS 社区@O3M[D(sES


0G nd+wC^0Steve Minutillo :: messy-78  PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss: a must read if you’re wrestling with PHP, XML and character encodings!…木铎校园 BBS 社区:s j4?+~"M#U$I


1}m9bQ}!D^g0
T8pl j9f0

MeriBlog : Meri Williams\' Weblog

R5R5{^u0Multitasking Mice Fertilitiy木铎校园 BBS 社区7DB_d6i
Interesting article over at OK/Cancel all about multitasking PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss — worth reading just for the title ;-) Very comprehensive guidebook to developing with web standards I love the…

8G"v9` x?0
木铎校园 BBS 社区2XkCi{ykJ

k-?OuO0

Grace and peace to you! » 2004 » June

Q0~z/j5Uv2z$uY*|3R0[…] Grace and peace to you!木铎校园 BBS 社区1_Nk s7C5Cf

木铎校园 BBS 社区4}ifg)[

»PHP, XML, and Character Encodings: a […]

$q jVrm)W0

~FB y)ZG.T&S0木铎校园 BBS 社区q-MfFH9?,\*~/N

Pete Prodoehl

木铎校园 BBS 社区/A(oQ c4TU9u

I too must suggest that using Mark’s Universal Feed Parser would be a good idea. In fact, since the process of harvesting the feeds, parsing them, and storing them in MySQL can be separated out from the whole UI/reading part of things, this is a great suggestion. I know it sort of makes fof a weird combo PHP/Python app, but it could be an option for those of use who don’t have a problem with that.

+`0t6p ] l%z,h S,v0
木铎校园 BBS 社区B o*hd c7M lj!Q6W

9z#W#M*XDO/~Y'R_0

LinuxBrit

木铎校园 BBS 社区~S%Z*xj uj

PHP, XML, and Character Encodings
k { F;o:cb0Man, PHP can be really unbelievably stupid sometimes :(

zSm;|ZFj.O0

El#m:_ Ln5ky ]0木铎校园 BBS 社区xOp xy

Elaine

y#M+X#Qb0yipes…I started taking a look at the problem when I was mucking about with a personal variant of FoF and could never quite figure out what was going on. I always thought it was a problem with Magpie, but I had no idea it went that deep. thanks for all your work on FoF, and for sharing all the gory details!

i1K_N(S6u2Co M0

I x_f8Z1y0木铎校园 BBS 社区rYe7]t;^bm#X

Simon Jessey

WT2b4K\0Multibyte string functions are not part of the PHP default install, so many webhosts do not include it - my own webhost refused to add it. Just thought I’d let you know.木铎校园 BBS 社区:[["Y$L]-P k"j)u


o?&P!C*zm WYDZ0
NCS Cq0

Dominic Mitchell

木铎校园 BBS 社区ym j#Y1Z4Xj;Hyx @

What about UTF-16? Your regex won’t be able to pick up the XML declaration then…木铎校园 BBS 社区s rPi*y/I:PV

木铎校园 BBS 社区}$u4p$CV'lB-c&r`

I love spanners. :-)木铎校园 BBS 社区9G7p:LbQw*f

木铎校园 BBS 社区%S7d y#Zu7E? Q+S

-Dom木铎校园 BBS 社区$nnA2D%h{8C Wy


iC,KoGV6c0
\5J_BG.zx4KY0

steve

木铎校园 BBS 社区T;g)|5Cp Ucf

inertia: I’m lucky enough to be on a host where mbstring and iconv are both included and work perfectly. I’ve actually never even compiled PHP myself, so I don’t really know what can go wrong there, other than the obvious “check phpinfo()”木铎校园 BBS 社区5o0D(pjW `

fs? v#oVy"^UV0Simon: I know. There’s nothing I can really do about that. In the next version of FoF the installer will inspect your system and tell you what it finds, so at least this will be less confusing.

9D;c(hl @9a0木铎校园 BBS 社区 P9Y q `n5u8g~Y0gO

Dominic: I knew there’d be a hitch somewhere. Are you sure? Have you tried it with UTF-16? If it doesn’t work, is there any workaround?木铎校园 BBS 社区k*tKkY4RX)F$XAM:W"k

木铎校园 BBS 社区&Y3M3mUA8nm`0}

u)Z7g Y/`-}0

David

木铎校园 BBS 社区 c,V|;[4R

Thanks, excellent post. Good to know someone’s working so we don’t have to :)木铎校园 BBS 社区"~*Rq PDb{K"P

木铎校园 BBS 社区O#f \ t{{.{9H3j#DY9z
木铎校园 BBS 社区 `s5Q-g:J I}

snapping links » lookin’ good

木铎校园 BBS 社区K{#XR`5Fo,{

[…] n into the problem with character encoding (darn curly quotes)…so I went back to look atwhat Steve Minutillo had to say abou […]

;^;Ni7B(g;\t2B0
木铎校园 BBS 社区rzw\)i AP

Y,yddb5T.a3f!j0

Andrew

j"I}D v*angi0Thanks for your post. It’s unfortunate that php’s support for i18n is so poor here. For what it’s worth (probably not much), EBCDIC is not spelled with an extra ‘I’, even though people pronounce it as if it did. Man, I hope nobody sends out news feeds in EBCDIC O:-).

na#?l%{;zc5h+R0
木铎校园 BBS 社区}%^d0mn%^ r.X
木铎校园 BBS 社区hX/@h]0?

车东Blog^2

H:H |.E3J!K3q2w)@^0Lilina:RSS聚合器构建个人门户(Write once, publish anywhere)
C$W1cdo~0最近搜集RSS解析工具中找到了MagPieRSS 和基于其设计的Lilina;Lilina的主要功能:

4O+? L i"}0木铎校园 BBS 社区w,D MC H/v

1 基于WEB界面的RSS管理:添加,删除,OPML导出,RSS后台缓存机制(避免对数据源服务器产生过大压力),scrīptLet…

6Mb8o%P3|u0

U?]$Rk+AFd-b|0木铎校园 BBS 社区4^F;cZ'|6?

车东BLOG

m#T!j/duZ%Wv0MagPieRSS中UTF-8和GBK的RSS解析分析:php中的面向字符编程详解木铎校园 BBS 社区)?T1dVS ^t;DI
第一次尝试MagpieRSS,因为没有安装iconv和mbstring,所以失败了,今天在服务器上安装了iconv和mtstring的支持,我今天仔细看了一下lilina中的rss_fetch的用法:最重要的是制定RSS的输出格式为’MAGPIE_O…

V/z_:?:`0

Ji-D-Z+bz,l2^ {A0木铎校园 BBS 社区+F$LdW'_7D.Q(F ]

grace

木铎校园 BBS 社区&@"o9O*C t9A`

how to do it even without using XML ? I just wan my page to be able to key in chinese character, submit to mysql in code form, then display on page the chinese character.

9m*C jA^%i9dh&o\+F0

&xK!f+b Umt;V D2G.i0What is the testing scrīpt which I can test for this?
0{q*j$E4nO!y0Thank you

c/Z uY/L1CfA0
木铎校园 BBS 社区3y0R*NbE

i0@F7w]*Q0

steve

木铎校园 BBS 社区?/@Uu Lq6m OT

grace: I don’t have any links to tutorials on that, but what you’re talking about is fairly easy to do. In fact, the free weblog engine that powers this site, Wordpress, is capable of doing just that. You could download Wordpress and examine how international characters are handled.木铎校园 BBS 社区%xV n.R"[7X1z0^+y

木铎校园 BBS 社区.L2xq7q Y1n"R
木铎校园 BBS 社区c#t RP5Ab

Valery\'s Mindlog

D8E,[ Lf c lG0 
:~}/u:I B0      , ..  ,     00:01.       ,      (    “ ”,      ).  …

Ud!Kp/oVc#C0

cGis*Dog3q:c`0
7^a l Gw6F.i0

Sascha Carlins Linkdump

fn|;i? K0This page has been linkdumped木铎校园 BBS 社区fSX/p0Fh5Nwf
Charsets…木铎校园 BBS 社区:i tw?'\


^H#S$W yO3R0木铎校园 BBS 社区/E,h mvIf`1pB

Andy

"o#~X+LT Mk-D0I’m developing a library called mbstring emulator which emulate mbstring functions. I already published mbstring emulator for Japanese(supports Shift_JIS, EUC-JP, UTF-8),and now I’m working with western language version(iso-8859-1 and utf-8). I’d like to know what encoding do you need .

5g7B KT4JJ|j^n%|X0

V?t%|X'@0木铎校园 BBS 社区+T E.m9_*J Ai d

steve

木铎校园 BBS 社区0^4Y.IYNj|

BIG-5 would be nice.木铎校园 BBS 社区5V0uZ,^#V!HQh


'O*^1w{(V1IA0
stx t `K Z)ilm0

ryh

!IX@a3?a Q{0Good job, I’m using this on my site.

N^c"k6_g3F#e0

Z Q6J2oO7V#f%@0木铎校园 BBS 社区2R*j@/D.K9qm8r

+CMS

木铎校园 BBS 社区e {R%_qA'O

Andy your mbstring emulator is perfect. thx

)C7mu-jjiP0

E+cA!z'_"AjI0
,O'{v4f(i&H,yD0

Charl47

3QC"@giSA8u7WY0Hi steve
.Z?Z&Xv6Nd2h0I’m using magpierss 7.0. Where may i put your synthaxe exactly. I don’t unsderstand all your explain, i’m newbee in RSS and I’m french. So it’s difficult for me to translate this.木铎校园 BBS 社区d({$Y0NLFr
Thank’s.

Ky s9b w*Nri0

!IW.w0P4I8VL0木铎校园 BBS 社区tw2DxP#IbM;b

steve

木铎校园 BBS 社区7S'o)jE?\

If you are using MagpieRSS 0.7, then this code is already built in! This was written before MagpieRSS 0.7 was created.

LTsXsb{u9q0
木铎校园 BBS 社区Q!iU%{;Ua
木铎校园 BBS 社区$] BG"d]!Ee

nwestwood

木铎校园 BBS 社区I!T4j aj!P&Q

I’m using magpieRSS to read multiple feeds and create 1 feed with news of interest to our industry and it works, mostly. When I write out the data I retrieve from magpie, I get “not well-formed” XML, for example & symbols in the URL’s that Feed Validator doesn’t like. Is there a way to get the data back and have it encoded correctly? or what do you suggest?

.h1}z @C B,i(l0木铎校园 BBS 社区H&ynu n.~

-thanks - Neal

!US Y@,yE'f9W-w.y0
木铎校园 BBS 社区xZm})tQ;\k#K#A.S

Bb[w(c7P9`([,xa0

steve

x a#dw#M)e`.S$wCu:[0Magpie is working the way Feed on Feeds uses it. You could try asking on the magpierss mailing list with some more specifics on your problem.木铎校园 BBS 社区oF*x%V_


#|T WA]/a-NgV8{s;A0
3f`.L+cB/s5I m9b0

junesnow17

)[/[u9d,B%k)c0我不知道我在這裡輸入中文字是否可以顯示出來
:G.y R5i+W0我的英文很差所以只好輸入中文

;e7?+ow6?&b0

'D"C!uIV;A G7w.Y6L0因為各種原因,我在學習php的同時發現我現在安裝的php版本太舊,所以花了兩天時間更新到最新的版本

z-{ xQN7_}@5E0

v/L(a_C0結果論壇中的會員名字全部變成亂碼@@

wgdq:R6dI0

]E'L(W"Il Z+B_0看了這篇文章,雖然問題沒有解決,我老公最終還是把資料庫還原到以前的版本,但是我十分感動,有人會在關注這件事情.

f,u2xMW+G$P0

:D5dj-l[TI |0因為文字,我們使用方塊字的這些群體都被忽略,往往寫一些程式的時候就會因為文字的限制而弄得暈頭轉向.希望能早一天,有國人都可以使用的php.

*d6qS7v-z2r1h F'\'{0

-f? ab;@q0不要再讓我們覺得自處受制了.木铎校园 BBS 社区1w0C(LZ[%O6GJ

木铎校园 BBS 社区Dp0G5r$cv X
木铎校园 BBS 社区2Yc-_q(u-yKo F K-E

Jari

#KCsa7F3a0Hello Steve
:h7T:Bm2Pr}0I was looking for this scrīpt for a long time. It looks great. But I get a parse error , unexpected ‘*’ , in the line $rx=木铎校园 BBS 社区 p A/|E X+Z
I made a copy/paste in Dreamweaver. What’s wrong in the syntax? Thanks to help me. Best regards木铎校园 BBS 社区V$]~:xim R9p


?{[9S/G-@}0木铎校园 BBS 社区NiI,J*E!J*t

jari

o/s&_7_biX0Hello

"I$N1p$} YTsd0

mA jr(X0I tried magpie rss_parse.inc scrīpt that use your post. Doing copy with notepad I have now no syntax error. I want to read xml files in ISO or UTF-8 . I tried the function : function php4_create_parser to test the encoding of the input files. I use the regular expression : ‘//m’ with preg_match, but if I do an echo of this regular expression I get nothing. What’s wrong? How can I get the encoding of the xml files ? May you help me.

:I-{2hvPz)^CD0木铎校园 BBS 社区Ad qQ K;Grc?

Best regards

_x Q"T-DF O0

#EV,ab e3@u$U4L0木铎校园 BBS 社区P qU7Yug

Javier

木铎校园 BBS 社区)d"DO*l"{e

Thanks so much man, this has solve a huge problem I had. Ive been searching for a solution for a while, parsing XML with PHP 4.x could be frustrating specially if your XML files need to use entities for different languages (in my case, spanish).

7Z-WC-J-]CS(c/E]0木铎校园 BBS 社区{M {yKb'P+b

Ive learned a lot with this, thanks again ^_^木铎校园 BBS 社区u`?)Oo o9v


7W1Mfm^9s"Q)O Y0木铎校园 BBS 社区yhOGC2n?H!p

Asha

木铎校园 BBS 社区0Ncx'^eTN

Hello,
t2x5s H;Us-T j4s?0I am using PHP and MYSQL to store Japanese data. I am facing encoding problem. I am not able to store it in table with proper encoding can any one help me?

Z+]3r/bpXT0
木铎校园 BBS 社区.n2l$yy7yL
木铎校园 BBS 社区IrSXS`1z9_

K

木铎校园 BBS 社区F"^qP+U]O4ovI

Too late probably, but just replace all “” from the string before executing the regexp, and it should work with UTF-16木铎校园 BBS 社区#W,lF;X|

木铎校园 BBS 社区U*`Vg&y3`&j

8PW H$`9Q ?"R*[0

K

gSBs5euJ0ok, my last comment was “mangled”

r3h"F)B0e0

r)M'f-_'w!T\,H0Replace all 0-bytes from the string. That’s \ by the way

Fp;doe"[K+GB(s w0
木铎校园 BBS 社区ot zc f}\ q
木铎校园 BBS 社区Kg \&O2o

oink

z5T#igXF S#xd0i suggest to go a little further with the regular expression in a network world full of screwed up source codes:木铎校园 BBS 社区3V-@;EJ8GSz;f

木铎校园 BBS 社区H'T TqF-_m

preg_match( “/]+encoding\s*=\s*[’”]?([\w._]+(-[\w._]+)*)[’”]?[^>]\?>/i”, $xml, $m )木铎校园 BBS 社区\b6uXw

9e5e9A(m Q I'H6{0it’s possible i screwed up here myself, double check advised, i didn’t test it yet.

+^fn \y3`b0
木铎校园 BBS 社区E4\m!ac)W[,N

0v`Q*RJ5b K0

oink

木铎校园 BBS 社区o }A2Z` B

well seems like your comment page doesn’t translate “smaller than”.

q.A7P.pU8wv0
木铎校园 BBS 社区-p"x2LK6o`+Yy0X&b

'G%I~eE+aN0

steve

木铎校园 BBS 社区&{5t#jvE

Sorry about that… I think Wordpress probably has a similar screwed up regex that tries to sanitize comments that munged yours.木铎校园 BBS 社区A7q9K jP6A9pwo

木铎校园 BBS 社区| ["Do(k

;`fI&JD4V0

pepe

.T5Lpg,q*])w0hola egañádos木铎校园 BBS 社区WZ S)R2xvsM-W

木铎校园 BBS 社区5Y4C*z zi.n@K
木铎校园 BBS 社区6F'k$v1oYk9Vm

Andrew

木铎校园 BBS 社区dDAG9y

Hi, I am using MagpieRSS 0.7 but I still am getting character substitution! please see this site still in progress:木铎校园 BBS 社区1~ _ B'u/C:X
www.andrewzahn.com/crissa

+IaJ1\|)C0木铎校园 BBS 社区%^7f-K9zfKz*G:@

can you tell what could cause this?木铎校园 BBS 社区y;H/v'K7t5?$R`y,m]\'d
thanks木铎校园 BBS 社区i2ZIXPe:H


V t n;{ |)l)|0
pUI&E#d8S*PZE0

Jason Judge

木铎校园 BBS 社区^8o0| B~4_.y%ti,Z-G

Just a note on this bit:

Z;l9w ]R:].Y&g0

K){4Om6g9[7@m/iSV8t0“If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding=”utf-8″ and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it.”

O,StdF F Nr$U2D0

F#TE#y8P6QMK0It should be noted that any UTF-8 stream can be treated as a valid ISO-8859 stream, since an ISO stream is a series of bytes. However, the reverse does not hold true. There are ISO-8859 (and other single-byte and multibyte streams) that turn out to be invalid as UTF-8 streams.木铎校园 BBS 社区(zl:Z y7W5_C

木铎校园 BBS 社区bn5LrMK#X

The reason is that a series of independant bytes is a series of bytes, but UTF-8 has strict rules in which ranges of byte values can follow certain other bytes.木铎校园 BBS 社区5|am6JXXv_

木铎校园 BBS 社区PY*U7SpKYT

The parser may not fall over now when it hits these invalid sequences, but I am not sure it is safe to assume that will always be the case.

T"{P"Nm9|+q0

0k5M"xD7H"IG5F0I think it would be safer to send an unknown or unhandlable encoding into the parser as ISO8859, and then convert the entities afterwards. *That* is why the parser defaults to ISO and not UTF.木铎校园 BBS 社区r%dH^WAT


mF#im+sSL0木铎校园 BBS 社区+I6|iQ*beW#V

matt

木铎校园 BBS 社区 ` A;A+?n4L

Steve,木铎校园 BBS 社区5C#a2_ cz8w0E6@(`

g?L ogZk:x0Great stuff here. I’m using Magpie v 7a. The parser looks to have I incorporated your encoding fix for PHP4. On my page, particularly in the Yahoo News feeds, several of the characters are converted to question marks. I looked at the original feed, and the special characters are an apostrophe and an mdash. Apparently, all apostrophes aren’t equal as some are handled well and others are converted to ‘?’. The apostrophe causing problems slants backwards (almost like an accent).木铎校园 BBS 社区 sr7[%Z\q%U

]#~DXK6j WdJ }0All in all, I would say that the output is still pretty good, and completely readable. It would be icing on the cake to fix this problem with special characters.木铎校园 BBS 社区5N^)\&wBe-J-^

木铎校园 BBS 社区1a1|TPu'G ]:sx,]~

The Yahoo feed is UTF-8. Here is the link to the actual XML file:木铎校园 BBS 社区-a,F/r {E uatZ

木铎校园 BBS 社区-Iz Q.y |

link

s%j"JaK&x0

%N5sf#UMw*yJ3h(x%[0Off topic, comparing Yahoo News and Google News feeds. Yahoo is much better in terms of their advanced search options. Google has no provision to exclude based on keywords. However, Google incorporates nice thumbnails into their output descrīptions. This makes them very appealing, and it will be interesting to see how long it lasts before others begin incorporating small thumbnails as well. I could see including a conditional that would only display stories which contained thumbnails.木铎校园 BBS 社区+t5Y p*g]!Ru

木铎校园 BBS 社区+_5y9lm o
木铎校园 BBS 社区8VX:p2v"isSuco

unclepiak ลุงเปี๊ยก

U1u;e T9VQ0sound interesting ! ขออนุญาตทดสอบภาษาไทย木铎校园 BBS 社区@\I1hJ\Zl[K

木铎校园 BBS 社区iKf,R_m#TQ
木铎校园 BBS 社区9j"ZB:n5L V+O7v]2E

Rasmus

I pz4]+_AJ_kf0I had the same problem using magpie .72 and found that the only solution that worked, was to ditch my UTF-8 feed and replace it with an ISO-8859 feed instead. After changing that setting in WordPress and clearing magpie’s cache, everything worked perfectly.

i'b\Px0

Rn+@|_J0
G@At"WV0

Oliver

木铎校园 BBS 社区| PuR!_ Yc

Hey all.. Thanks for the post. I am actually trying to mod this plugin right now for my site for something having to do nothing with favicons and would love some help if anyone is willing to trade a few emails.木铎校园 BBS 社区4TT gS1Df(D


(R.}q\/Y'wm[f'E0木铎校园 BBS 社区!y O r.X+msYn

Matt

木铎校园 BBS 社区9Ikz4\*B/eo m

If you, like me, found no luck here in resolving your char. munging issue in Magpie 0.72 and RSS (Atom works fine, right?) then perhaps what I did may help you…

qf)x ^ dB(`0

y)DV#ZI0when you include and define your Magpie params be sure to include this line:木铎校园 BBS 社区(Xh,G%}d|mI*n

X9y*uja*\$R u0define(’MAGPIE_OUTPUT_ENCODING’, ‘UTF-8′);

al.P HW[u/sP+Y0

0V3h o l1Oz0and in the head of your HTML docuemnt be sure (for browser compatability and user preference consideration) to add/change the following:木铎校园 BBS 社区3_xy~Dh0I/I

木铎校园 BBS 社区 n(C X;pp~*K0n9d

(less than sign) meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ / (greater than sign)

fD5_*J Bwk0木铎校园 BBS 社区y{1eJ5~9D q,d0x\

of course you will replace the signs indicated in parentesis… without the parenthesis… hehehe

'rlM9v3i"vu+Z0
木铎校园 BBS 社区5@X.R?+|/A3t

TAG: LAMP

 

评分:0

我来说两句

显示全部

:loveliness: :handshake :victory: :funk: :time: :kiss: :call: :hug: :lol :'( :Q :L ;P :$ :P :o :@ :D :( :)

关于作者