This work develops a resource allocation (RA) scheme using deep reinforcement learning (DRL) for device-to—device (D2D) communications underlay cellular networks. Where a cellular channel can be assigned to several D2D links, RA aims to find the transmission power and spectrum channel of D2D links to optimize the sum of the average effective capacity of all cellular and D2D links in a cell accumulated over many time steps. Considering performance over several time steps and letting a cellular channel be shared by several D2D links demand a high degree of system overhead and computational difficulty so that optimal RA is practically impossible in this scenario, especially when a lot of D2D links are involved. We present a sub-optimal RA scheme based on a multi-agent DRL, which runs with shared information in participating devices, such locations and allocated resources, to reduce the complexity. Every agent matches every D2D link and multiple agents learn in a staggered and cyclic pattern. The proposed DRL-based RA system distributes resources to D2D devices fast based on dynamically changing network configurations including device locations. When the amount of devices in a cell is high, the proposed sub-optimal RA scheme beats other schemes in which the performance gain becomes noteworthy.