Deep reinforcement learning (DRL) is usually referred deep q-learning. It is a neural network approach to estimate q-values for all states and actions in a Markov Decision Process (MDP). The basic concept is very simple. There are many different implementations with different tricks.

The deep reinforcement learning is very simple if you know reinforcement learning already. DRL estimates a q-value function using a neural network. The inputs to the NN is a state (s) and an action (a). The output from the NN is the estimated q-value of that state and action q^hat(s, a). The tricky part is how to train the NN. We would need the ground truth of q(s, a) and we can call it q_star(s, a). Since we don’t know q_star(s, a), so we would use the estimated q(s, a), which could be computed using Expectation of V-values, which is Expectation(Reward + \lambda max(q^hat(s’, a’))). \lambda is the reward discount. q^hat(s’, a’) could be commuted using the current NN. So it is the deep reinforcement learning, at least the basic of it.

If you don’t understand the computation of q-value and v-value, you would need to learn reinforcement learning.