[gentoo-user] Any 'sed' geniuses out there? - gentoo-user

From:	gentuxx <gentuxx@×××××.com>
To:	gentoo-user@l.g.o
Subject:	[gentoo-user] Any 'sed' geniuses out there?
Date:	Tue, 27 Sep 2005 03:51:34
Message-Id:	`4338C064.3090207@gmail.com`

1

-----BEGIN PGP SIGNED MESSAGE-----

2

Hash: SHA1

3

4

I'm writing a sed script that will parse the *broken* output of

5

man2html. I say broken, because the output isn't W3C compliant (html

6

OR xhtml). I'd like to be able to modify it so that the final outcome

7

is XHTML 1.0 compliant. I'm running into a problem where the output

8

doesn't close the <p>, <dt>, or <dd> tags. XHTML requires that tags

9

containing text be closed. So the problem I'm having is being able to

10

take note of the starting tag, grab the subsequent paragraph, then

11

insert the closing tag. What I've got /sort of/ works, but still not.

12

13

Here's a sample that has been parsed, but not with the <p> modifying

14

elements:

15

16

<p>

17

18

Regular expression support is provided by the PCRE library package,

19

which is open source software, written by Philip Hazel, and copyright

20

by the University of Cambridge, England.  See <a

21

href="http://www.pcre.org/">http://www.pcre.org/</a> .

22

23

<p>

24

25

Nmap can optionally link to the OpenSSL cryptography toolkit, which is

26

available from <a

27

href="http://www.openssl.org/">http://www.openssl.org/</a> .

28

29

30

Here's the entire sedscr (sans comments):

31

32

/^$/{

33

N

34

        /^\n$/d

35

}

36

/^Content-type: text\/html/c\

37

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

38

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

39

s%<\(HTML\|P\|HEAD\|TITLE\|BODY\|STRONG\|EM\|H[123456]\|D[DLT]\|T[TDRH]\)>%\L<\1>%g

40

s%<\/\(HTML\|P\|A\|HEAD\|TITLE\|BODY\|STRONG\|EM\|H[123456]\|D[DLT]\|T[TDRH]\)>%\L</\1>%g

41

s%<BR>%<br />%g

42

s%<HR>%<hr />%g

43

s%<[Dd][Ll] [Cc][Oo][Mm][Pp][Aa][Cc][Tt]>%<dl compact="compact">%

44

s%<A HREF\(.*\)>%<a href\1>%g

45

s%<A NAME\(.*\)>%<a name\1>%g

46

/^<[IB]>.*$/{

47

N

48

        s%\(<[IB]>\)\(.*\)\(<\/[IB]>\)\n%\L\1\2\L\3%

49

}

50

/^<[ib]>.*$/{

51

N

52

        s%\n%%

53

}

54

s%<[IB]>%\L&%

55

s%<\/[IB]>%\L&%

56

/<body>/,/<\/body>/{

57

        /<p>/!{

58

H

59

d

60

}

61

        /<p>/{

62

x

63

                s/$/<\/p>/

64

G

65

}

66

}

67

/^<p>$/,/<\p>$/{

68

N

69

        /^\n<p>$/d

70

}

71

72

73

Here's the funkiness after parsing with the last part

74

(/<body>/,/<\/body>/{) enabled:

75

76

<p>

77

78

<p>

79

80

Regular expression support is provided by the PCRE library package,

81

which is open source software, written by Philip Hazel, and copyright

82

by the University of Cambridge, England.  See <a

83

href="http://www.pcre.org/">http://www.pcre.org/</a> .</p>

84

85

<p>

86

87

<p>

88

89

Nmap can optionally link to the OpenSSL cryptography toolkit, which is

90

available from <a

91

href="http://www.openssl.org/">http://www.openssl.org/</a> .</p>

(Just in case you were wondering, this IS from the nmap man page. ;-)

96

Thanks.

97

98

- --

99

gentux

100

echo "hfouvyAdpy/ofu" | perl -pe 's/(.)/chr(ord($1)-1)/ge'

101

102

gentux's gpg fingerprint ==> 34CE 2E97 40C7 EF6E EC40  9795 2D81 924A

103

6996 0993

104

-----BEGIN PGP SIGNATURE-----

105

Version: GnuPG v1.4.1 (GNU/Linux)

106

107

iD8DBQFDOMBkLYGSSmmWCZMRAnnrAJwKNqr+/OgBdDD8X8PXX6rpKUfaxQCfU9PW

108

Bs2oA/76RYFbbc7DWEpfTM8=

109

=gcc/

110

-----END PGP SIGNATURE-----

111

112

--

113

gentoo-user@g.o mailing list

Gentoo Archives: gentoo-user

Replies

1	-----BEGIN PGP SIGNED MESSAGE-----
2	Hash: SHA1
3
4	I'm writing a sed script that will parse the broken output of
5	man2html. I say broken, because the output isn't W3C compliant (html
6	OR xhtml). I'd like to be able to modify it so that the final outcome
7	is XHTML 1.0 compliant. I'm running into a problem where the output
8	doesn't close the <p>, <dt>, or <dd> tags. XHTML requires that tags
9	containing text be closed. So the problem I'm having is being able to
10	take note of the starting tag, grab the subsequent paragraph, then
11	insert the closing tag. What I've got /sort of/ works, but still not.
12
13	Here's a sample that has been parsed, but not with the <p> modifying
14	elements:
15
16	<p>
17
18	Regular expression support is provided by the PCRE library package,
19	which is open source software, written by Philip Hazel, and copyright
20	by the University of Cambridge, England. See <a
21	href="http://www.pcre.org/">http://www.pcre.org/</a> .
22
23	<p>
24
25	Nmap can optionally link to the OpenSSL cryptography toolkit, which is
26	available from <a
27	href="http://www.openssl.org/">http://www.openssl.org/</a> .
28
29
30	Here's the entire sedscr (sans comments):
31
32	/^$/{
33	N
34	/^\n$/d
35	}
36	/^Content-type: text\/html/c\
37	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
38	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
39	s%<\(HTML\\|P\\|HEAD\\|TITLE\\|BODY\\|STRONG\\|EM\\|H[123456]\\|D[DLT]\\|T[TDRH]\)>%\L<\1>%g
40	s%<\/\(HTML\\|P\\|A\\|HEAD\\|TITLE\\|BODY\\|STRONG\\|EM\\|H[123456]\\|D[DLT]\\|T[TDRH]\)>%\L</\1>%g
41	s%<BR>%<br />%g
42	s%<HR>%<hr />%g
43	s%<[Dd][Ll] [Cc][Oo][Mm][Pp][Aa][Cc][Tt]>%<dl compact="compact">%
44	s%<A HREF\(.*\)>%<a href\1>%g
45	s%<A NAME\(.*\)>%<a name\1>%g
46	/^<[IB]>.*$/{
47	N
48	s%\(<[IB]>\)\(.*\)\(<\/[IB]>\)\n%\L\1\2\L\3%
49	}
50	/^<[ib]>.*$/{
51	N
52	s%\n%%
53	}
54	s%<[IB]>%\L&%
55	s%<\/[IB]>%\L&%
56	/<body>/,/<\/body>/{
57	/<p>/!{
58	H
59	d
60	}
61	/<p>/{
62	x
63	s/$/<\/p>/
64	G
65	}
66	}
67	/^<p>$/,/<\p>$/{
68	N
69	/^\n<p>$/d
70	}
71
72
73	Here's the funkiness after parsing with the last part
74	(/<body>/,/<\/body>/{) enabled:
75
76	<p>
77
78	<p>
79
80	Regular expression support is provided by the PCRE library package,
81	which is open source software, written by Philip Hazel, and copyright
82	by the University of Cambridge, England. See <a
83	href="http://www.pcre.org/">http://www.pcre.org/</a> .</p>
84
85	<p>
86
87	<p>
88
89	Nmap can optionally link to the OpenSSL cryptography toolkit, which is
90	available from <a
91	href="http://www.openssl.org/">http://www.openssl.org/</a> .</p>
92
93
94
95	(Just in case you were wondering, this IS from the nmap man page. ;-)
96	Thanks.
97
98	- --
99	gentux
100	echo "hfouvyAdpy/ofu" \| perl -pe 's/(.)/chr(ord($1)-1)/ge'
101
102	gentux's gpg fingerprint ==> 34CE 2E97 40C7 EF6E EC40 9795 2D81 924A
103	6996 0993
104	-----BEGIN PGP SIGNATURE-----
105	Version: GnuPG v1.4.1 (GNU/Linux)
106
107	iD8DBQFDOMBkLYGSSmmWCZMRAnnrAJwKNqr+/OgBdDD8X8PXX6rpKUfaxQCfU9PW
108	Bs2oA/76RYFbbc7DWEpfTM8=
109	=gcc/
110	-----END PGP SIGNATURE-----
111
112	--
113	gentoo-user@g.o mailing list