Artificial intelligence is not a technological advance that has appeared overnight; but it is now when it is on everyone’s lips, thus generating a great social debate. Although in this article we are not going to touch on the issues that are currently discussed on the street on this topic, we are going to analyze one of the numerous advantages that its use can provide your company. In this sense, we are going to talk about call transcription. Transcribing calls had always consisted of manually expressing in writing what was said in a call, either at that precise moment or from a recording. And why is it important? Easy and simple, because it has always served to enhance customer satisfaction.stakeholders, especially from customers. Converting the content and key points of a telephone conversation into a written document is not the same as leaving them in charge of our ability to remember. That is, when a customer perceives that their data, problems, suggestions or opinions are collected and followed up by the company, their satisfaction with it increases. And the better the assessment of the customer experience, the greater the business success. And what has artificial intelligence achieved? make the process easier for us. Now we only have to worry about talking on the phone with our interlocutors; the rest remains in your hands. Advances in artificial intelligence, specifically in natural language processing (NLP) and large language models (LLM), have given rise to speech recognition systems that make it possible to automate the written conversion of what is being said with extreme precision. In addition, these automatic transcription systems enable a long list of applications associated with it that maximize the added value of your company. Broadly speaking, we can say that there are three alternatives based on artificial intelligence to carry out automatic transcriptions: Create your own solution using an open source model (What is OpenAI’s Whisper ASR like?), through APIs (such as OpenAI or Google) or hiring a service from a company that offers these solutions directly integrated into the company’s communication tools (such as Fonvirtual).In this article we are going to compare the options and study the pros and cons of each one so that each one can assess which one best suits the needs of their company.
Call transcription through an open source model
The artificial intelligence offered by some open source models stands out for its great versatility and precision in voice recognition. Their advanced Language models and sequence-by-sequence learning open up a very attractive range of possibilities for companies, since they allow the creation of a wide variety of voice applications such as transcription services, virtual assistants or speech analysis, which translates into new possibilities for user interactions with technology. By its nature, the use of open source models offers freedom of adaptation that allows developers to modify the system to meet specific needs or requirements. For example, they allow operational optimization, since it identifies areas for improvement and analyzes resource management to increase the efficiency of productive operations in order to maximize results and minimize the resources used; They allow you to unlock valuable information that, in most cases, has been overlooked; as well as allowing us to predict new user needs, market trends or problems that should be eradicated. However, they also have limitations. Companies should consider whether they have enough internal expertise to develop an open source application from scratch. It is very easy to fall into the trap and not be able to scale as expected because these models are innovative, but also intransigent, and require constant updates and significant allocations of resources.
How much does an open source model cost?
Is it really always cheaper to host and use an open source model than to opt for other options? It all depends on the specific case of your company, but, in general, for those that consist of a complex network – that needs to transcribe a high volume of content – hosting an open source model ends up being more expensive. The total cost of ownership required to host, optimize, and maintain the open source model at scale must be considered. It is true that running an open source model is quite affordable, but be careful! Much more goes into creating your own in-house solution that uses open software.
Factors to take into account in the total cost:
- Accommodation:The cost of running the CPU – responsible for processing the input text, applying NLP algorithms and generating speech output – and the GPUs, which are used to accelerate the NLP algorithms, is significant due to the cost of these systems and its scarcity.
- Human capital: Adequate hosting requires at least two senior software developers, whose annual salaries can exceed 80,000 euros. It also requires a data scientist and a project manager, so their salaries have to be taken into consideration.
- Red– The higher the data transfer speed required by voice-to-text technology, the higher the network costs.
- Authentication: the process of verifying the identity of devices and/or users may include an additional cost over the cost of software or hardware, as well as the payment of security certificates or other mechanisms to guarantee authentication.
- Security: In line with the above, you should also invest in the cost of intrusion detection and prevention systems, firewalls, antivirus or other security measures.
- Maintenance: An open source model requires software updates over time and technical assistance.
- Certification: in case you want to obtain official certification of your own voice recognition and text conversion solutions.
Whether this price is worth it will depend on your use case and your project needs.
Call transcription from APIs
APIs are cloud services that offer developers pre-built tools and interfaces to convert spoken words (audio or video) into written text. To process audio input and generate text output, they employ a combination of traditional and deep learning models, such as recurrent neural networks (RNN), convolutional neural networks (CNN), or transformer-based models. In other words, APIs use machine learning algorithms, as well as large training data, to transcribe spoken language. This model, unlike the previous one, does not require a large internal infrastructure for its maintenance. An API can be accessed from any device with internet access and its development or updates do not fall on the person who uses it, but on the person who offers it. However, it does have certain limitations in terms of file size and latency. For example, OpenAI’s Audio to Text Transcription APIworks with an audio file whose weight must be less than 25 MB. This results in more wasted time and decreased performance. In fact, if we exceed the duration of the audio, the transcription is truncated and a summary with incomplete data is obtained. In other cases the division of audio is necessary and may cause a loss of context if we have to cut the files. On the other hand, the latency or response time of the API means that in many cases these systems are not useful for performing real-time transcriptions that are very necessary for certain business services such as translation or real-time conversation analysis.
How much does it cost to transcribe with an API?
With APIs we forget about the cost that the installation, development and maintenance of the system entails. In this case, the cost lies in the rates offered according to the duration of the audio files. The prices offered are very competitive, ranging from half a euro cent per minute for the use of the OpenAI API to two cents per minute in the case of Google Cloud. However, although at first it is very attractive, we have to consider the total duration of what we usually transcribe. This is a very good option for those companies with a medium-low volume of audio files that require transcription. However, for those who need to transcribe video conferences or long telephone conversations, the price to pay becomes considerably more expensive.
Call transcription as a service offered by a company specialized in communications with artificial intelligence
Hire a service offered by a company specialized in communications solutions based on artificial intelligence, such as Fonvirtual, allows, like APIs, to forget about having the infrastructure, human capital or economic capacity to install, develop and keep the system updated. However, unlike the previous one, it offers even more advanced functions, such as: creating summaries; the identification of emotions; personalization in interactions with clients; real-time transcriptions with simultaneous translation into other languages. The fact that all the company’s communications pass through its systems allows it to have access to the conversations and to be able to transcribe in real time conversations from telephone calls or video conferences in real time and display them in the different interfaces without the company having to manipulate or send audios. Additionally, since the virtual PBX can be integrated with other business management tools, such as CRM and project management software, transcripts can be sent to those systems in real time to be exploited; Can identify suspicious conversation patterns to protect against fraud; and, among others, allows compliance with regulations such as HIPAA or GDPR, which guarantee the privacy and confidentiality of data. The possibility of transcribing in real time is an ideal function for those companies that not only have a large volume of audio files, but also need speed to obtain the content and keywords from them, such as those companies with a high interaction with stakeholders. Call transcription is, in the case of Fonvirtual, one of the many solutions offered on its platform. This platform, which can be linked to the company number, is an internal and external communication tool by voice, chat and video that can be used from anywhere in the world, from any device and without a large investment. For example, among one of its many solutions is the telephone charging with credit card with total security.
How much does it cost to hire the Fonvirtual service?
Hiring a service like the one offered by Fonvirtual means that there are never any surprises when paying. This is a periodic payment for a service that is highly adaptable to the needs of your company and without variable costs.
Comparing the models
To begin the comparison, we must first differentiate that open source models offer a solution that the company has to install, configure and customize for it to be operational. APIs also require certain developments, but very simple ones. And yet the call transcription service integrated into the virtual switchboard does not require development, personnel or infrastructure. The latter offer a service that is in the cloud and that integrates the voice recognition capabilities of artificial intelligence into their applications and platforms, allowing companies to forget about delving into the complexities of voice recognition algorithms and the configuration of the infrastructure without giving up the benefits of artificial intelligence. In terms of productivity, the transcription option integrated into the virtual switchboard is the one that achieves the greatest balance while offering customizable options and superior performance thanks to the optimization achieved by bringing together a large number of clients. Replicating optimized models (including LLM and generative AI models) in open source models is challenging. Regarding start-up time, in open source models it must be taken into account that creating a holistic artificial intelligence solution for voice recognition from scratch can take around a year. With aAPI or the service offered by a specialized company can get value from AI-powered features from day one of implementation. Furthermore, by running and maintaining open source models, organizations do not depend on a third-party service, so they have full control of it. Especially relevant when servers are offline. However, the lifecycle is much shorter with open source because you don’t get updates, so be prepared to update software and hardware every couple of years. Transcription in open source models, as well as through APIs, requires preparing the audios and if we want to transcribe phone calls we have to record, download and send. And, although there is no file size limitation in open source models, the response speed that is needed is not always achieved. For their part, as we mentioned previously, APIs usually have a limitation on file size, so transcriptions may remain incomplete or lose quality if they have had to be divided into several files. In the option integrated into the switchboard, you can access the transcriptions in real time via the web without having to worry about anything. The use that most companies make of call transcripts is almost always the same: Knowing the customer, detecting attitudes, labeling conversations, detecting business opportunities or risks, taking notes, making conversation summaries, staff training, etc… For this reason, call transcription integrated into the switchboard also offers a turnkey solution for exploiting this information, which avoids having to carry out subsequent analysis of the transcriptions. In short, a call transcription solution made with an open source model is much higher than the other options but the result can be very powerful and fully customized. The use of solutions through API is economically attractive because they have a variable cost, require some development by the company and allow a lot of customization, but they are limited by aspects such as latency, which is key if we need the solution to work in real time. And finally, a call transcription solution integrated into the switchboard like Fonvirtual’s has very competitive rates and although it does not allow as much customization, they have information exploitation systems that satisfy the needs of most companies.