The database for this English-Hindi dictionary is taken from the Anusaaraka
project, English-Hindi dictionary V2.0, see
http://www.iiit.net/ltrc/Dictionaries/Dict_Downld.html and is in GPL as it was.
It has more than 25 600 headwords and low quality.

As that text file was made by school children and non-IT professionals, it was
in a very bad shape.  The file looked OK in an editor, but not to a parser,
because there were spaces where they shouldn't be and many other format
exceptions. So only after some manual editing of that file, the parser has
accepted it.  Those differences are in the .diff file.

The file in this package has the diff already applied.

ehconv.yy is a parser. It is compiled with flex and will convert the text
database file, which is encoded in ISCII into a TEI P4 XML file, encoded in
UTF-8, see http://www.tei-c.org/ and http://unicode.org/.  For more information
about ISCII, see http://tdil.mit.gov.in/standards.htm and
http://www.cwi.nl/~dik/english/codes/indic.html (beware, the tables use octal
codes!).

The final step of converting the .tei file into .dict format can be done using
xmltei2xmldict.pl.

Now take your Unicode compatible dict server and client and enjoy the nice
Devanagari!  (When you use a web interface, Internet Explorer >5 can be used.
Otherwise gnome-dictionary can render Devanagari.)

As the Anusaaraka project aimed at machine aided translation between many
Indian languages, there are several other dictionaries available at
http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html.  BTW, the dict here is
_not_ the Shabdanjali English-Hindi dictionary! Also, for each dictionary there
is further information at the IIIT server.

With a 'make reverse' you can generate a reverse index for querying the
dictionary in Hindi. When converting TEI into dictd database format, the
converter might complain about "whitespace only headwords". These are cases
where the Hindi translation of a word is not included in the dictionary. In the
XML they look like "<tr>?</tr>". The question mark is removed because it is
punctuation. Empty headwords are left out of the index.

The Part-Of-Speech (<pos>) element contents don't follow the typology outlined
in the FreeDict HOWTO. Try `make pos-statistics' to see what is used.

There is much scope for future work :)

Michael

If your dict client/browser don't support Devanagari rendering,
you could translate everything into roman encoding:
uniconv -encode IS_DV -in eng-hin.tei  | uniconv -decode IS_RM >eng-hin.roman.tei
(uniconv is part of the great yudit package). But there is
one thing not working: uniconv cannot count syllables and so
creates an 'a' too much sometimes in the transliteration :(
(also, ISCII code e5 becomes "o with a bar on top"
as well as   code e1 becomes "e ..." -> We substitute simple "o" and "e" for them)

=================================================

The TEI P4 version of this database was created by 
FreeDict tools on the basis of the accompanying 
non-XML source file(s). Next, that TEI P4 version 
was converted to TEI P5 using revision 978 of the 
tools/xsl/freedict_P4toP5.xsl script.

In some cases I checked whether a newer version 
of the database existed but I wasn't lucky. 
Still, any maintenance of this dictionary should 
begin by checking on the existence of an updated version.

Piotr Bański, 3 November 2010