[SOLVED] GPIO direct access


#1

I’ve tried to run software spi by using XIO pins and little confusing about performance. I’ve used libsoc library to manipulate with GPIO, which uses sysfs for GPIO, so as I understand it is bottleneck in performance GPIO manipulation. Is there exists other methods to direct access to GPIO instead sysfs?

For example, this code will stuck my program for a long time

void spi_send(char data)
{
    int i;
    for (i = 0; i < 8; ++i)
    {
        char value = (0x01 & (data >> i));
        libsoc_gpio_set_level(LCDPort.mosi, value ? HIGH : LOW);
        nsleep(16);
        libsoc_gpio_set_level(LCDPort.sck, HIGH);
        nsleep(16);
        libsoc_gpio_set_level(LCDPort.sck, LOW);
        nsleep(16);
    }
}

int main()
{
    int j;
    for (j = 0; j < SCREEN_WIDTH * SCREEN_HEIGHT; j++) 
    {
        spi_send(0x00)
        spi_send(0x00);
    }  

        return 0;
}

#2

A couple of points. First, I think you will find that “nsleep(16)” does not in fact sleep for anything like 16 nanoseconds on a typical Linux machine. I did some timing measurements which I have yet to publish (needs some touching-up), but it sleeps for much longer.

Second, the XIO lines are slow because they are off-CPU. There’s a serial communication (I2C) that happens between the CPU and the device which controls the XIO lines. You want to use the CSI lines, which are driven directly by the CPU. I am not familiar with the libsoc library, and did my testing by directly using the sysfs driver. But even using that, the CSI lines gave MUCH higher performance.

I’ll try to put the finishing touches on my measurements and post them later today.


#3

Thank you for your reply.
nsleep it is my custom implementation of nanosleep, it looks like

void nsleep(long nanosec)
{
    struct timespec tw = {0, nanosec};
    struct timespec tr;
    nanosleep (&tw, &tr);
}

but I’ve also tried GPIO without any sleeps - same results.
So, I2C communication XIO with the proccessor might be reason for a low performance. I will try to use pins from CSI, as I understand from this topic there are pins from 128 to 139?


#4

If you’re trying to bitbang SPI, I believe the SPI code in my Adafruit_GPIO port will work: https://github.com/xtacocorex/Adafruit_Python_GPIO

I’ve never done SPI, nor do I have any SPI devices, so I am unable to test.


#5

I believe that you will want to stick to using the following ports for general purpose GPIO, but I could be wrong.

CSID0 132
CSID1 133
CSID2 134
CSID3 135
CSID4 136
CSID5 137
CSID6 138
CSID7 139

From the CHIP layout diagrams 128 to 131 have names other than CSI-Dx, like CSI-PCLK, CSI-MCLK, CSI-HSYNC, and CSI-VSYNC.

Please post how this works out as I’m interested in doing bit-banging to read the temperature from a DHT22 temperature sensor using code based on Adafruit’s code for the raspberry pi.

The Adafruit code uses routines called bcm2835_gpio_write(), bcm2835_gpio_fsel(), bcm2835_gpio_lev(), and bcm2835_init(). I need to convert those calls to something equivalent for CHIP.

@infrapro: Can you post a sample of any working code you have using GPIO 132? It appears that you are using libsoc from https://github.com/jackmitch/libsoc.


#6

I’ve tested bitbanging with CSI port (pin 132), and it works much faster than via XIO, but for me is not enough. Below you can find speed comparision I made by logic analizer. So, max frequency through XIO (pin 408) is 4KHz, CSI (pin 132) - 166Khz.

XIO

CSI

This is example GPIO manipulation through libsoc, I’ve tested with -O3 optimization flag

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>

#include "libsoc_gpio.h"

#define GPIO_OUT 408 // 132

int main()
{
    int ret = -1;

    gpio *gpio_output = libsoc_gpio_request(GPIO_OUT, LS_SHARED);
    if (gpio_output == NULL)
    {
        printf("Faile to open GPIO %d\n", GPIO_OUT);
        goto fail;
    }

    libsoc_gpio_set_direction(gpio_output, OUTPUT);
    if (libsoc_gpio_get_direction(gpio_output) != OUTPUT)
    {
        printf("Failed to set direction to OUTPUT\n");
        goto fail;
    }

    int  i;
    for (i = 0; i < 1000; ++i)
    {
        libsoc_gpio_set_level(gpio_output, HIGH);
        libsoc_gpio_set_level(gpio_output, LOW);
    }

    ret = 0;

    fail:
    if (gpio_output)
    {
        libsoc_gpio_free(gpio_output);
    }

    return ret;
}

compile:
gcc -lsoc -O3 main.c -o main


GPIO Performance
Adapting a RaspPi moisture sensor project for CHIP
#7

You can do the exact same memory mapped I/O thing that the bcm2835 functions do but you need to convert the memory address. The base address for the gpio pins on the the A13 is 0x01C20800 but I think the register layouts are somewhat similar so just adjust the bitfield calculations. See the Port Register List / Port Register Description chapters in the A13 User Manual for their layout.

I used the CSI pins to bitbang avrdude, results were 450 bits per second effective throughput on XIO, 13,150 bits effective throughput per second on the CSI. I couldn’t get 129 (CSI_MCLK/SPI2_CLK) to switch to output so I had to move down to CSID* pins.


#8

I have look to sunxi-tools sources and found there GPIO manipulation via memory mapping in pio.c file. I’ve used some parts of code and made performance test and I’ve surprised with wonderful speed. With -O3 optimization my logic analyzer can’t capture frame on top sample rate (24MHz)! It means that speed more faster, I can see it only if set sleep in loop or without optimization
Below example code listing. Pin 132 is equivalent for port PE4, so we use 2nd and 3+rd letters to port access

pio_get(buff, 'E', 4, &pio);

Example:

#define _DEFAULT_SOURCE
#define _BSD_SOURCE

#include <errno.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#include "endian_compat.h"

#define PIO_REG_SIZE 0x228 /*0x300*/
#define PIO_PORT_SIZE 0x24

struct pio_status {
    int mul_sel;
    int pull;
    int drv_level;
    int data;
};

#define PIO_REG_CFG(B, N, I)    ((B) + (N)*0x24 + ((I)<<2) + 0x00)
#define PIO_REG_DLEVEL(B, N, I)    ((B) + (N)*0x24 + ((I)<<2) + 0x14)
#define PIO_REG_PULL(B, N, I)    ((B) + (N)*0x24 + ((I)<<2) + 0x1C)
#define PIO_REG_DATA(B, N)    ((B) + (N)*0x24 + 0x10)
#define PIO_NR_PORTS        9 /* A-I */

#define LE32TOH(X)        le32toh(*((uint32_t*)(X)))


static int pio_get(const char *buf, char port_name, uint32_t port_num, struct pio_status *pio)
{
    uint32_t port = port_name - 'A';
    uint32_t val;
    uint32_t port_num_func, port_num_pull;
    uint32_t offset_func, offset_pull;

    port_num_func = port_num >> 3;
    offset_func = ((port_num & 0x07) << 2);

    port_num_pull = port_num >> 4;
    offset_pull = ((port_num & 0x0f) << 1);

    /* func */
    val = LE32TOH(PIO_REG_CFG(buf, port, port_num_func));
    pio->mul_sel = (val>>offset_func) & 0x07;

    /* pull */
    val = LE32TOH(PIO_REG_PULL(buf, port, port_num_pull));
    pio->pull = (val>>offset_pull) & 0x03;

    /* dlevel */
    val = LE32TOH(PIO_REG_DLEVEL(buf, port, port_num_pull));
    pio->drv_level = (val>>offset_pull) & 0x03;

    /* i/o data */
    if (pio->mul_sel > 1)
        pio->data = -1;
    else {
        val = LE32TOH(PIO_REG_DATA(buf, port));
        pio->data = (val >> port_num) & 0x01;
    }
    return 1;
}

static int pio_set(char *buf, char port_name, uint32_t port_num, struct pio_status *pio)
{
    uint32_t port = port_name - 'A';
    uint32_t *addr, val;
    uint32_t port_num_func, port_num_pull;
    uint32_t offset_func, offset_pull;

    port_num_func = port_num >> 3;
    offset_func = ((port_num & 0x07) << 2);

    port_num_pull = port_num >> 4;
    offset_pull = ((port_num & 0x0f) << 1);

    /* func */
    if (pio->mul_sel >= 0) {
        addr = (uint32_t*)PIO_REG_CFG(buf, port, port_num_func);
        val = le32toh(*addr);
        val &= ~(0x07 << offset_func);
        val |=  (pio->mul_sel & 0x07) << offset_func;
        *addr = htole32(val);
    }

    /* pull */
    if (pio->pull >= 0) {
        addr = (uint32_t*)PIO_REG_PULL(buf, port, port_num_pull);
        val = le32toh(*addr);
        val &= ~(0x03 << offset_pull);
        val |=  (pio->pull & 0x03) << offset_pull;
        *addr = htole32(val);
    }

    /* dlevel */
    if (pio->drv_level >= 0) {
        addr = (uint32_t*)PIO_REG_DLEVEL(buf, port, port_num_pull);
        val = le32toh(*addr);
        val &= ~(0x03 << offset_pull);
        val |=  (pio->drv_level & 0x03) << offset_pull;
        *addr = htole32(val);
    }

    /* data */
    if (pio->data >= 0) {
        addr = (uint32_t*)PIO_REG_DATA(buf, port);
        val = le32toh(*addr);
        if (pio->data)
            val |= (0x01 << port_num);
        else
            val &= ~(0x01 << port_num);
        *addr = htole32(val);
    }

    return 1;
}

int main()
{
    char *buff = malloc(PIO_REG_SIZE);
    int pagesize = sysconf(_SC_PAGESIZE);
    int addr = 0x01c20800 & ~(pagesize - 1);
    int offset = 0x01c20800 & (pagesize - 1);

    int fd = open("/dev/mem",O_RDWR);
    if (fd == -1) {
        perror("open /dev/mem");
        exit(1);
    }
    buff = mmap(NULL, (0x800 + pagesize - 1) & ~(pagesize - 1), PROT_WRITE | PROT_READ, MAP_SHARED, fd, addr);
    if (!buff) {
        perror("mmap PIO");
        exit(1);
    }
    close(fd);
    buff += offset;

    struct pio_status pio;
    pio_get(buff, 'E', 4, &pio);

    pio.mul_sel = 1;
    pio.data = 0;
    pio.drv_level = 0;
    while(1)
    {
        pio.data = !pio.data;
        pio_set(buff, 'E', 4, &pio);
        //usleep(1);
    }

    return 0;
}

Second serial port
GPIO Performance
#9

Excellent! I’ll start experimenting with it pretty soon.

I’ve been doing more experimentation with bit-banging CSI GPIO pins using “/sys/class/gpio” driver, producing a steady square wave using clock_gettime() and busy waiting. With CSI pins, I can get a pretty uniform square wave. But from time to time, Linux will steal the CPU away for periods of time on the order 10 ms. I think the kernel decides that the process thread has exhausted its time quantum and it lets other processes a chance to run.

This makes sense since the CPU only as a single core. Linux doesn’t need much, but it needs some.

Basically, a single core Linux system isn’t going to be a reliable source of long-running, bit-banged signalling.


#10

@infrapro - Yev, congratulations, you’ve surpassed the speed of light … well, as far as your logic analyzer can tell :scream:

[quote=“fordsfords, post:9, topic:2971”]
Basically, a single core Linux system isn’t going to be a reliable source of long-running, bit-banged signalling.
[/quote] @fordsfords - Steve, "You’re gonna need a bigger boat … " and it’s called the M/V (Merchant Vessel) Real-Time :sunglasses:


#11

And finally, there is libsoc fork with implementation for direct GPIO, it is not finished yet, but generally works. Example how to use you can find here


Reading DHT11 / DHT22 / AM2302 sensors
#12

If you mean a “Linux Real-Time” kernel, that still won’t help when the important thread (the one doing the time-critical timing) is 100% CPU bound. For any kind of multi-threading to work requires that high-priority threads spend at least some amount of time asleep (in the operating system sense). If a high-priority thread is alseep, it needs something to wake it up. The normal Unix sleep functions (including nanosleep) are VERY inaccurate at small sleep times since they depend on the periodic Unix clock interrupt, which typically has a 1 ms period.

One interesting question which has been asked a few times is whether one of the header pins can be used as an interrupt. This could allow running a MUCH higher rate clock, although there are limits there as well. Server grade processors start to sweat at interrupt rates above 100,000 per second. Another barrier to easy use would be that an interrupt automatically puts Linux into Kernel mode, and kernel programming is a horse of a different color. The transition between kernel and user space represents a large overhead, and is a major reason for the current speed limit of CSI access via the “/sys/class/gpio” driver. The best way to implement high-accuracy, high-rate timing would be to write most of the code in the form of a kernel module and use interrupts. Not being a kernel wonk, it would take me quite a while to figure that out.

Alternatively, the use of a microcontroller is conceptually the same as adding a core to CHIP, and has the advantage that it doesn’t have Linux. A CHIP/microcontroller combo gives you CHIP for WIFI, SSL, Bluetooth, graphics, USB, user interface work, and the microcontroller for busy-looping pulse generation, etc. (Although I must admit that I’ve never programmed a microcontroller either.)


#13

@fordsfords - Steve, who said anything about Linux? There are real real-time OSes Out There that are used by the embedded and industrial controller sectors (Rockwell Automation, Allen-Bradley, National Instruments, etc.) in every domain, such as manufacturing, assembly lines, chemical/biological processing, food processing, vehicle control … even semiconductor fabrication (the controllers control fabbing of the devices used to make the controllers … whoa, how meta, man!).

Of course, we need something that will run on ARM, so our choices are narrowed, and if we want open-source, they’re even more limited, but surprisingly, here’s that list of options: uKOS, Atomthreads, BeRTOS, BRTOS, CapROS, ChibiOS/RT, ChronOS, CoActionOS, Contiki, distortos, dnx, eCos, Embox, ERIKA Enterprise, EUROS, FreeOSEK, FreeRTOS, FunkOS, Fusion, FX, ISIX, ISIX, iRTOS, Lepton, Milos, mipOS, MMLite, MQX, Neutrino, nOS, Nucleus OS, Nut/OS, NuttX, OpenEPOS, OS21, OpenRTOS, picoOS, QP, RIOT, RTAI, RTEMS, RT-Thread, RTX Keil, scm, SDPOS, sil, T-Kernel, TI-RTOS Kernel, Trampoline, TNKernel, TNeo, TUD:OS, Unison, Xenomai, Y@SOS, and uOS (some of these may be based on real-time Linux kernels, so downloader beware).

Also, many only run on certain ARM architectures, mostly v7 and Cortex M3, and many may not be compatible with drivers for some/all of the peripherals that the C.H.I.P. has, so a lot of ferreting through docs and source would be needed. Since they are open-source though, one could theoretically stitch together a Frankenstein (C.H.I.P.enstein?) of the features needed from code that is compatible with the R8, if none of these fulfills all of your needs. It sure beats starting from scratch at the bare-metal level in assembly, unless you’re already an expert or you want to learn to become one, and there’s certainly nothing wrong with that.


#14

@infrapro I’ve been testing this, and I would say this is not going so fast… In my 2GS/s scope there is no squared signal. To me this means that -O3 (also -O2 and -O) optimization, are being too clever and not performing the actual code. If you make a change and instantly make the change back, the optimizer takes the code away. It is not the first time this happens to me, so be careful with -OX…

Is anyone able to confirm this?


#15

Consider also to memlock and set the linux schedule to fifo.
Has seen this on raspberry pi.
Something like this. if you address to the register in direct memory access.
And seen they also include the rt (real time library) -L rt

C++
#include <sys/mman.h>
using namespace std;

int main() {
// Set the maximum possible priority and switch from regular Linux
// round-robin to FIFO fixed-priority scheduling.
struct sched_param sp;
sp.sched_priority = sched_get_priority_max(SCHED_FIFO);
if (sched_setscheduler(0, SCHED_FIFO, &sp)<0) { // change scheduling
cout << "Failed to switch from SCHED_RR to SCHED_FIFO" << endl;
return 1;
}
// lock the process' memory into RAM, preventing page swapping.
if (mlockall(MCL_CURRENT|MCL_FUTURE)<0) { // lock cur & future pages
std::cout << "Failed to lock the memory." << std::endl;
return 1;
}
.
.
.
munlockall(); // unlock the process memory

C
struct sched_param sp;
memset(&sp, 0, sizeof(sp));
sp.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &sp);
mlockall(MCL_CURRENT | MCL_FUTURE);