# Fundamental Puzzle Type Survey
**Status:** draft | partially implemented

Over the course of building Crosswords, we've identified a number of
puzzle-specific data structures that are needed by the code base. Some
of these have robust implementations, others are partially
implemented, while some are reimplemented multiple places in the code
base or just don't exist.

This document is an attempt to list all the fudamental types we may
need as well as list some of the performance tradeoffs associated with
them. The end goal is to move a number of these types into libipuz so
that it can be used universally.

As an additional goal, we have a number of libipuz APIs that
accept/return GArrays. Best practice for gobject-introspection.

# Text and Strings

First a note on text handling. All text internally is encoded as
`UTF-8`. We validate it on input, and assume it's valid everywhere in
the codebase (similar to GTK). In multiple places, we effectively
convert the string to `UCS-4` by treating it as a stream of
`gunichars`.

With crosswords, things are a little weird as we break text into
discrete cells independent of the larger context. Unfortunately,
breaking up UTF-8 a character at a time can result in nonsensical
results. It works well enough for English and other latin scripts, but
long term we need to move to a stream of discrete graphemes. We've
taken some initial steps that way with the [misc cluster
functions](https://gitlab.gnome.org/jrb/crosswords/-/blob/master/src/crosswords-misc.h?ref_type=heads#L39),
but that is a really basic implementations and is just used for paste.

Ideally we'd find a robust grapheme library and involve that for
indexing strings.

**Note:** this is all predicated on graphemes being needed for other
languages, such as Indic scripts.

## WordList

Note, the WordList currently does **not** support multi-codepoint
graphemes, though it does support multi-byte UTF-8 strings. If we ever
support a language with a word-list that requires that we will have to
reassess this.

# Fundamental Structs

We have two structs that are used everywhere:

```C
typedef struct
{
  IPuzClueDirection direction;
  guint index;
} IPuzClueId;

typedef struct
{
  guint row;
  guint column;
} IPuzCellCoord;
```

Of these two, `IPuzCellCoord` is much more embedded in the codebase,
but both are widely used.

# Custom Data Types

## Character Set

`IPuzCharset` is used in a surprising number of places. It's one of
the most versatile structures we have. Some places it's used:

* Each puzzle has a charset listing its valid characters
* This charset can be passed to PlayCell to filter out letters that
  aren't part of the puzzle
* We use it to keep a histogram of letters in use in a puzzle
   * We also keep a histogram of clue lengths by casting the length to
     a gunichar (!)
* It's used in the fast path for both the Grid and Acrostic solver. We
  add and remove characters to them every iteration of the loop.
* We calculate the potential overlapping characters in a cell to limit
  options for users.
* etc.


**Challenges:**
* Currently, we have the mutable charset builder -> charset
  pattern. That has kept lookups fast for the Acrostic solver, but
  we've ran into places where we need to keep a mutable charset
  around. As an example, in `word_list_find_intersections()`, we want
  to set a builder's value to be a multiple of another factor. There's
  no way to go through a builder and adjust its values, so we have to
  create a second one.
* We store a unicode codepoint per node. This is fine for all the
  languages we support, but may not be fine for eg. hindi.


## Cell Array

We have a couple of instances of the CellArray and they have very
different performance characteristics.

* Selection keeps a sorted array and tests for uniqueness. Each cell
  should only exist once.
* Another common use is every clue has a list of the cells that it
  refers to
* The solver has a list of cells to search through, and removes/adds
  cells as it traverses the tree and backtracks. (This usecase may
  need a specialized structure though).

This data structure needs to be sorted/unique, or ordered depending on
where it's used.

## Grid Words

Surprisingly, we have three different representations of words in the
code base.

* There are basic C strings. These come from user input or the puzzle file.
* The `WordList` also has C strings encoded in it. These strings are
  special, as the bytes before it encode priority and enumerations
* Finally, there's the `WordIndex` struct. This struct can be used to
  lookup a word in a `WordList`

We commonly want to index clusters in the string. There is a lot of
code that iterates through UTF-8 strings in the codebase that is
incorrect (see string handling above) and thus we may want to consider
a specialized word type. Consider the following (very artificial) grid
entry:

![spinal tap](spinaltap.png)

This is encoded with the following UTF-8 string and cell boundaries:

```
S  P  ı     N̈        A  L  T  A  P
53 50 C4 B1 4E CC 88 41 4C 54 41 50 00
  |  |     |        |  |  |        |
```

the *S* and *P* characters are each one codepoint / byte. The
*dotless i* (U+0131) is an additional codepoint and two bytes. The
*N-diaresis* is contrived of two codepoints (U+004E and U+0308) and
three bytes. Finally, the last three characters — *TAP* — are all in
one cell as a rebus entry.

A very useful data structure would be one that breaks a word into its
graphemes, and lets you iterate through them.

It may also be useful to be able to indicate the offset of a cell
array it exists in the grid in so we can index it. An example of a
function that could return this is
`ipuz_crossword_get_clue_string_by_id()`.

In the above example, `word[6]` would return the string `"TAP."` That
being said, outside the context of a grid the offsets are meaningless
and potentially misleading, so this would need a bit of thought to
make sure it is useful.

## Word Array

We have lists of words showing up in multiple places. There's
`WordList`, `WordArray`, `WordListModel`,and `WordArrayModel`. In
addition, multiple places just use a `GArray` of strings. Some
consolidation would be nice at some point, especially if we ever get a
good Grid Word type.

## Filters

Filters are used by the word list to describe a partially filled in
clue (eg. `"GN??E"`). Currently, filters are represented as
strings. We have multiple independent implementations of filter
validation code. There are alos a couple common operations that we run
frequently on filters, such as counting how many '?' characters are in
the filter — are multiple fastpaths triggered by totally open or
closed filters. Having a simple filter class would be nice.

## Guesses

IPuzGuesses is a simple overlay over a puzzle grid. It stores a
string-per-cell, and is used to keep the guesses that a user makes in
the game. It's currently not a GObject, though there's an outstanding
MR to make it one.

Recently, the editor started using the guesses object to store pencil
markings as a way to record scratch marks. It also is used to show
results from the autofill functionality. Looking at other crossword
editors, it may be useful to start running a background task that
starts calculating things like intersection characters or mark trouble
spots. If we start doing that, we'll need to store more per cell than
the current string. Given that, storing custom data per cell is a
useful extension.

**NOTE:** We'd have to approach this somewhat carefully. Currently the
`PuzzleStack` doesn't keep guesses that are attached to puzzles and
these would be cleared every grid/guess change. We'll have to test
assumptions, and see if keeping the Guesses around could save
computational work.
