[chatgpt]Reward Model Training Process update (#3133)

* add normalize function to value_head in bloom rm

* add normalization to value_function in gpt_rm

* add normalization to value_head of opt_rm

* add Anthropic/hh-rlhf dataset

* Update __init__.py

* Add LogExpLoss in RM training

* Update __init__.py

* update rm trainer to use acc as target

* update example/train_rm

* Update train_rm.sh

* code style

* Update README.md

* Update README.md

* add rm test to ci

* fix tokenier

* fix typo

* change batchsize to avoid oom in ci

* Update test_ci.sh

BlueRum committed 3y ago

7548ca5a54ed117f03247dcb43ec1dd962ae04e0

Parent: 1e58d31

Committed by GitHub <noreply@github.com> on 3/20/2023, 1:59:06 AM