# TF-NLP Data Processing

## Code locations

Open sourced data processing libraries:
[tensorflow_models/official/nlp/data/](https://github.com/tensorflow/models/tree/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data)

## Preprocess data offline v.s. TFDS

Inside TF-NLP, there are flexible ways to provide training data to the input
pipeline: 1) using python scripts/beam/flume to process/tokenize the data
offline; 2) reading the text data directly from
[TFDS](https://www.tensorflow.org/datasets/api_docs/python/tfds) and using
[TF.Text](https://www.tensorflow.org/tutorials/tensorflow_text/intro) for
tokenization and preprocessing inside the tf.data input pipeline.

### Preprocessing scripts

We have implemented data preprocessing for multiple datasets in the following
python scripts:

*   [create_pretraining_data.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/create_pretraining_data.py)

*   [create_finetuning_data.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/create_finetuning_data.py)

Then, the processed files with `tf.Example` protos inside should be specified to
the `input_path` argument in
[`DataConfig`](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/config_definitions.py#L28).

### TFDS usages

For convenience and consolidation, we built a common
[input_reader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/input_reader.py)
library to standardize input reading, which has built-in pass for TFDS.
Specifying the arguments in the
[`DataConfig`](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/core/config_definitions.py#L28),
`tfds_name`, `tfds_data_dir` and `tfds_split`, will let the tf.data pipeline
read from the corresponding dataset inside TFDS.

## DataLoaders

To manage multiple datasets and processing functions, we defined the
[DataLoader](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/data_loader.py)
class to work with the
[data loader factory](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/data_loader_factory.py).

Each dataloader defines the tf.data input pipeline inside the `load` method.

```python
@abc.abstractmethod
def load(
    self,
    input_context: Optional[tf.distribute.InputContext] = None
) -> tf.data.Dataset:
```

Then, the `load` method is called inside each NLP task's `build_input` method
and the trainer wrap that to create distributed datasets.

```python
def build_inputs(self, params, input_context=None):
  """Returns tf.data.Dataset for pretraining."""
  data_loader = YourDataLoader(params)
  return data_loader.load(input_context)
```

By default, in the example above, `params` is the `train_data` or
`validation_data` field of the `task` field of the experiment config. `params`
is a type of `DataConfig`.

It is important to note that, for TPU training, the entire `load` method will
run on the TPU workers and it requires that the function does not access
resources outside, e.g. the task attributes.

To work with raw text features, we need to use the `DataLoader`s handling the
text data with TF.Text. You can take the following dataloaders as references:

*   [sentence_prediction_dataloader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/sentence_prediction_dataloader.py)
    for BERT GLUE fine tuning using TFDS with raw text features.

## Speed up training using TF.data service and dynamic sequence length on TPUs

With TF 2.x, we can enable some types of dynamic shapes on TPUs, thanks to TF
2.x programing model and TPUStrategy/XLA works.

Depending on the data distribution, we are seeing 50% to 90% speed up on typical
text data for BERT pretraining applications relative to padded static shape
inputs.

To enable dynamic sequence, we need to use
`tf data service` for the global bucketizing over
sequences. To enable it, you can simply add `--enable_tf_data_service` when you
start experiments.

To pair with tf data service, we need to use the dataloaders that has the
bucketizing function implemented. You can take the following dataloaders as
references:

*   [pretrain_dynamic_dataloader.py](https://github.com/tensorflow/models/blob/28d972a0b30b628cbb7f67a090ea564c3eda99ea/official/nlp/data/pretrain_dynamic_dataloader.py)
    for BERT pretraining on the tokenized datasets.