Tokenization is the process of splitting running text (which is a string of characters) into processing units, called tokens, which are usually either words or subword units.
Tokenization usually has a large effect on the performance of a system. When testing methods, it's good to use a reasonable baseline such as BPE for tokenization, and to keep tokenization the same when comparing systems. It is common for improvements to tokenization to outweight possible model improvements. In addition, mismatch between the tokenization used during training and testing can be detrimental to performance, which can be an issue when running off-the-shelf tools in downsteam applications.
For a fixed size of vocabulary, the number of tokens per byte can give a measure of the quality of the tokenization (lower is better), see Jurassic-1 paper section 2.1.
For some tasks, it can be useful to filter out some words (stopwords).