This is a library for converting Unicode strings to numbers. Standard
functions like strtoul and strtod do this for numbers written in the usual
Western number system using the Indo-Arabic numerals, but they do not handle
other number systems. The main functions take as input a UTF-32 Unicode
string and compute the corresponding unsigned integer. Internal computation
is done using arbitrary precision arithmetic, so there is no limit on the size
of the integer that can be converted.

Although there are a variety of additional features, the basic use of the
library is very simple, perhaps simpler than using strtoul(3). Here is
the minimal code needed to convert a UTF-32 string to an unsigned integer.
We assume that str is a a wchar_t * containing a null-terminated UTF-32
string. You will also need the appropriate includes, which are discussed
in the more elaborate example below.

union ns_rval val;
unsigned long myint;

StringToInt(&val,str,NS_RETURN_ULONG,NS_ANY);
if(0 == uninum_err) myint = val.u;

This call to StringToInt attempts to convert the string str
and if successful places the result in val.u. It sets the
flag uninum_err to a non-zero value if an error occurs.
The argument NS_ANY tells StringToInt to attempt to determine
the number system itself. If it is unable to do so, uninum_err
will be set to NS_UNKNOWN_ERR.

The value of the string is returned in one of three forms.
One option is a string of ASCII characters containing the decimal
representation of the integer using the Indo-Arabic digits. This option has
the virtue of avoiding any possibility of overflow or truncation. The second
is to obtain the value as an unsigned long integer. If you are going
to do internal calculations, this is probably the most convenient option,
but some numbers (in fact, infinitely many) will not fit into an unsigned
long integer. The library guarantees that no overflow or truncation will occur;
if the number will not fit, it sets an error flag and returns 0.
The third option is to obtain the value as a GNU MP object of type mpz_t.
This is useful if you are going to do further arbitrary precision calculation.

The library assumes that the input is in UTF-32 Unicode, with two exceptions.
The writing systems for Klingon and Tengwar are not formally recognized by
the Unicode consortium. We assume the encodings registered with the Conscript
Registry. The encodings for Egyptian hieroglyphics and Sinhala are
the proposed Unicode encodings, which are not yet (as of version 5.0) official.

The basic interface to the library is the function StringToInt.

void StringToInt (union ns_rval *ReturnValue, wchar_t *s, short ReturnType, int NumberSystem);

The first argument is a pointer to a union of a string and an unsigned long:

union ns_rval {
  char *s;
  unsigned long u;
  mpz_t m;
};

This is used to store the "return" value.

The second argument is the UTF-32 string that you wish to convert. The third argument
indicates whether the return value should be a string, an unsigned long integer,
or an object of type mpz_t. The fourth argument specifies the number system
expected, e.g. NS_CHINESE. The constants specifying number systems are
defined in nsdefs.h.

If a string is returned, it is your job to free it.
If an object of type mpz_t is returned, when you are done with it
it is your job to remove it by calling mpz_clear.

StringToInt is an interface to a set of functions that each handle a single
writing system, e.g. ArabicToInt, DevanagariToInt, etc. These functions have the
same calling conventions except for the fact that they take no number
system argument.

The function WesternToInt assumes that the base is 10. The function
WesternGeneralToInt takes an additional argument specifying a base in the range [2,36].
It expects strings without base specifiers such as "0x" for hex. It overlaps
in function with strtoul(3). Its most likely use is in cases in which you want
to deal with numbers too large to fit into an unsigned long integer.

The auxiliary function

int StringToNumberSystem (char *);

returns a number specifier corresponding to a name such as "Chinese", or
NS_UNKNOWN if it does not recognize the name. The inverse function

char *NumberSystemToString (int);

is also provided.

The function:

char *ListNumberSystems(int);

is a generator that enumerates the number systems known to the library. Each time it
is called with a non-zero argument it returns another number system name. Calling it
with an argument of zero resets it to the beginning of the list.

For example, the following line will print the list of supported programs on
the standard output:

    while (ds = ListNumberSystems(1)) printf("%s\n",ds);


In almost all cases, it is possible to determine the number system from a single
string. The auxiliary function:
 
int GuessNumberSystem(wchar_t *); 

returns a number system identifier corresponding to the number system of the string
it is passed. It returns NS_UNKNOWN if it does not recognize the number system
and NS_ALLZERO if the string consists entirely of zeroes. The number system of
such a string cannot be determined unambiguously since several number systems
previously lacking a zero have added one recently and sometimes use the same
glyph and codepoint. However, it is desirable to distinguish this case from
NS_UNKNOWN for two reasons. First, the value of such a string is determinate,
namely 0. Second, if you know that all of the data you are dealing with is in the
same number system, it is sensible to adopt different strategies in dealing
with the two cases. If the first item returns NS_UNKNOWN, you had might as well
abandon processing as you are not going to be able to deal with it. If the first
item returns NS_ALLZERO, you can expect to determine the number system
from subsequent items, most of which will most likely not consist entirely
of zeroes.

The variable:

int uninum_err;

is used to report errors. It is set to zero at the beginning of every
call so you need not do it yourself. A non-zero value indicates an error.
The errors defined are:

NS_BADCHARACTER
	indicates that the string contains a character that it should not.
	The first character that was not recognized is placed in the
	variable uninum_badchar.

NS_DOESNOTFIT

	indicates that the number represented by the string does not fit into an
	unsigned long integer.

NS_UNKNOWN_ERR

	indicates that the writing system is not recognized.

NS_BADBASE
	WesternGeneralToInt has been called with a base outside the
	valid range of [2,36].

NS_NOTCONSISTENTWITHBASE
	WesternGeneralToInt has been applied to a string that
	contains a character not possible in the specified base.
	(For example, if the specified base is 8, neither 8 nor 9 nor
	any of the letters can validly appear in the string.)

Two other ancillary functions are provided. 

wchar_t *NormalizeChineseNumbers (wchar_t *s);

Replaces simplified and variant Chinese numerals with their standard, traditional
counterparts, which are the only ones understood by ChineseToInt. This function
may reallocate storage since some such replacements increase the number of
characters in the string. It is called automatically when ChineseToInt is
called via StringToInt.

wchar_t StripSeparators (wchar_t *s, wchar_t separator);

Returns a string from which the "thousands" separator specified in its
second argument has been stripped. Since most non-Western writing systems
rarely or never use such separators, it is not called automatically,
but you may find it useful.

The following program illustrates the use of the library. You may also
find it useful study the source for numconv.c, which provides a command-line
interface to the library.

------------------------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <gmp.h>

#include <uninum/nsdefs.h>
#include <uninum/uninum.h>

/* Create two UTF-32 strings */
wchar_t *s1=L"\x0A67\x0A69\x0A68"; /* Gurmukhi */
wchar_t *s2=L"\x0ED5\x0ED7\x0ED6"; /* Lao */

int
main(int ac, char **av) {
  int ns;

  /* This is where the "return" value will be stored */
  union ns_rval val;

  /* So that we can check whether it has changed */
  uninum_err = 0;

  /* We already know what number system this should be */
  StringToInt(&val,		/* pointer to return receiver */
	     s1,		/* the string to convert */
	     NS_RETURN_STRING,	/* flag requesting result as an ascii string */
	     NS_GURMUKHI);	/* number system */

  /* The string is in the s member of union val */
  if(!uninum_err) printf("%s\n",val.s);

  /* Pretend we don't know what number system s2 is in */
  ns=GuessNumberSystem(s2);
  printf("The second number system is: %s\n",NumberSystemToString(ns));
  if(ns == NS_UNKNOWN) exit(2);

  /* So that we can check whether it has changed */
  uninum_err = 0;
  StringToInt(&val,
	     s2,
	     NS_RETURN_ULONG,	/* flag requesting result as an unsigned long int */
	     ns);		/* number system value obtained from GuessNumberSystem */

  /* Unsigned long is in u member of union val */
  if(!uninum_err) printf("%u\n",val.u);

  exit(0);
}
 
