enlightenbio  Blog

Basepair Provides an Intuitive, Easy-to-Use NGS Data Analysis Platform Which Does Not Require Coding Skills

This month’s Company Spotlight provides a closer look at Basepair, a developer and supplier of a suite of automated NGS analysis solutions for research, clinical, and pharma teams. The company’s mission is to provide an intuitive, easy-to-use platform which requires no coding skills, and therefore allows any bench scientist to run their own NGS analysis pipelines without the help of a bioinformatician. Basepair users can select from over 30 popular NGS analysis pipelines using a cloud-based software that also utilizes some AI algorithms. Results are made available via an interactive report with intuitive visualization tools.

I had the opportunity to visit Amit Sinha, founder and Chief Executive Officer of Basepair, at his office in New York on Madison Avenue to discuss the company and its unique strengths in the market.

EB: Tell us more about Basepair – your solution allows users to analyze their own RNA, DNA, ChIP, or ATAC data with efficiency and ease. Why did you start Basepair and develop its products?

Amit Sinha:  The conventional wisdom for many researchers is that sequence data analyses are complex in nature and challenging for biologists to run themselves. Working as a researcher at Harvard Medical School I initially helped analyze NGS data. It became quickly apparent that the workflow was  frustrating not only for me having to run the same time-intensive analysis pipelines over and over, but also for my colleagues who had to put up with long turnaround times until they received their results. I was the bottleneck. Fast forward to 2012, I had joined Memorial Sloan Kettering Cancer Center as an investigator, where the same issue emerged repeatedly. During that time, several products that address sequence analysis were launched to market – DNAnexus, Seven Bridges Genomics, or Galaxy – but in my eyes, none offered the right solution for the challenges I observed. Furthermore, free tools were unintuitive and difficult for non-bioinformaticians to navigate, while the paid products required expensive subscriptions that not all labs could justify. That is when I decided to step away from academia and build a solution that could be utilized by non-experts. This eventually became Basepair.

“When you think of Basepair, think of products like Slack or DropBox – these are products which are a joy to use with a very low barrier to entry. One can get going in minutes.”

I really wanted to create a quality product that lowered the barrier to entry so anybody with sequence data — whether they had 10 samples or 100 — could go online and get started quickly. As an additional consideration: not everyone has a $100,000 budget or dedicated bioinformatics support. Maybe a researcher has just 15 samples that she or he needs to analyze. Such cases are truly numerous, and a fast, affordable solution was missing.  As a result, we decided to focus on the product in a more technological sense – taking the burden off the researcher and making it easier for them to get insights from their data. This is how Basepair got started, and we had a lot of success early on.

For the first year or two it was just referral-based growth. We did not even have a sales team, yet we ended up going international very quickly. There clearly was a market for a product that was not enterprise-focused and did not require large licensing fees before use.  When you think of Basepair, think of products like Slack or DropBox – these are products which are a joy to use with a very low barrier to entry. One can get going in minutes.

EB:  You mentioned you have built this product to address a certain need in the sector: in your words, what is it exactly you are trying to address? How would you explain this to a potential customer?

AS:  Our strength is that we think backwards. We take a target user segment and try to understand what these users know and what they want to do. If you talk to technical people, they talk about the cloud or the details of the architecture, etc. And Basepair is built on the cloud with a solid architecture under the hood, but this is not the first thing most of our users ask about. They want fast processing and data storage, but they don’t really care about the technical specifications. They care about the data and the results. They care about the fact that there is a simple website that allows them to upload their data, analyze their data, and quickly get the results they are looking for. The technical details are less of a center-stage component. Making sense of the data is the most important goal.

EB:  How big is your team now?

AS:  Initially, we focused on building a product that was easy to use without much hand-holding and with a simple onboarding process. That was then, and now we are a team of 10, which include developers, bioinformatics scientists, marketing, and sales. We have also raised a seed round, and have now a commercial presence in the market. We are currently expanding the tech team.

EB: Who are you targeting with the software products you offer? Who is benefitting from using your analysis software solutions? 

AS:  Our product is suitable for two segments, the researcher and the clinical segment.

  • The researcher segment:
    • The value proposition is very clear: they upload the data, they run the analysis (all self-serve), and they get nice, publication-ready figures within a couple hours. Their goal is to publish new findings.
    • We are now also catering to commercial organizations, specifically kit manufacturers that provide an NGS-based kit solution to the research audience, and which requires a unique type of analysis. We are currently working with the R&D departments of several kit providers to develop custom, kit-specific analysis pipelines and make them available via Basepair. When a researcher purchases the kit, it will come bundled with Basepair’s data analytics. This really simplifies and streamlines the NGS sequencing and analysis process. Unfortunately, I can’t go into any specifics right now.
  • The clinical segment:
    • We are also moving into the diagnostics sector, but the details are not yet public. We are working with companies to build clinical pipelines that process raw sequence data into analyzed and interpreted reports in a scalable manner – I am talking a few hundred thousand samples a year or more.

Researchers are not currently able to build their own analysis pipelines in Basepair. It is on our roadmap, but for now it is not something we are focusing on. What we have done is built a set of standardized pipelines for specialized needs. The benefit of Basepair is its full stack solution, with its cloud-based infrastructure, with automated pipelines, and its visualization capabilities for reporting. 

Interactive Basepair genome browser – appropriate files are automatically loaded for visualization.

Having said this, if you are, for example, a bioinformatician at Merck and you have this perfectly built pipeline that you want the entire company to use, then you can put your pipeline into Basepair. In other words, you can build it externally and we then help import it via a service. If you have the individual tools in Docker containers, then we will build the pipeline for you, and it will run your Docker-based tools exactly the way you designed them. We charge for this as a service based on the complexity of the workflow. You can then share the final pipeline with internal users and everybody can run it without the need for bioinformatician support.

Basepair’s single cell RNA-seq report includes interactive clustering plots.

EB:  You mentioned that your product is suitable for assay kit solution providers, as well as pharmaceutical researchers. Do you also cater to clinicians?

AS:  We recently started our foray into clinical diagnostics with one of the biggest providers in the space. Basepair will be analyzing over a few hundred thousand samples a year. While our core platform scales well, we have had to beef up our security and auditing processes. With clinical data, there is a data residency requirement, e.g., labs in Germany would not want their data stored in the USA. We’ve expanded our platform to support these specific requirements. We do not, however, have a clinical knowledgebase, and therefore we rely on other public or private data sources.

EB:  Can I, as an end user, share my tools with the entire user base in Basepair? Is there a mechanism for that?

AS:  Yes, indeed! One example that highlights this capability comes from the Ebert lab at Harvard:

  • One lab member developed a custom CRISPR validation pipeline.
  • We took his scripts and set up the pipeline in Basepair.
  • Now over 20 people in his lab and 50 additional people in many other labs are using this specific pipeline.

EB:  The sequence data analysis space is very crowded with many players and different solutions. What is the biggest strength that Basepair brings to the table?

AS:  It seems like a crowded space, but really when you look more carefully, the various players are all offering very specific solutions for very specific tasks. Hence, we do not see ourselves competing with the likes of Sentieon, Bluebee, or DNAnexus. In nine out of ten cases, our biggest competitors are in-house bioinformaticians. Very often we hear from potential customers: “We have a guy who does the analysis for us.”

The industry today is still very much a services-driven industry. If people require bioinformatics experience, they hire five bioinformaticians. That is the current mindset we observe. I believe this is slowly changing as people recognize that those bioinformaticians are a bottleneck and this is not necessarily always the right approach. It is not that we plan to make bioinformaticians redundant, but I think it is meaningless for bioinformaticians to run the same pipelines over and over again. Just because you are a biologist and you cannot run a Linux job in a computing cluster does not mean that the most optimal solution is to hire a bioinformatician to do it.

“I strongly believe a mindset change needs to happen, especially for individuals who could have never imagined that they could do these types of analyses themselves.”

We are developing a new approach, a new way of analyzing data, a new way of integrating data, and we are doing all this in addition to providing a scalable workflow system. The best way to think about this is the scenario where an in-house bioinformatician — like in the Harvard lab mentioned above — develops a pipeline and offers that pipeline via an easy-to-use solution so that anyone with pipeline access can run it without consulting a bioinformatician first. In this specific example, the pipeline has by now been run many times — on some days over 300 times. That is how we see the industry evolving. I strongly believe a mindset change needs to happen, especially for individuals who could have never imagined that they could do these types of analyses themselves. In the past, they were told: “Hey, these tasks are too complex, don’t try to analyze this yourself. You will make a mistake, you will mess it up. Let the bioinformatician do it for you.” We are now saying: “No, you can do it yourself, and you will not mess it up.” By using Basepair, these users are more likely than bioinformaticians to notice red flags. Our report is very comprehensive and includes all the QC statistics as well. That is sort of what we want to do — to create a comprehensive solution around NGS.

Our strength is the simplicity of the software, simplicity of the workflow, and simplicity of pricing for people to get their foot in the door. Once their use case evolves, then we have more sophisticated options for them. All in all, it comes down to lowering the barrier to entry for an NGS data analysis product. Most of the other products on the market are more complex in nature and not meant for the individual researcher who cannot code and/or is not a computational researcher or bioinformatician.

EB:  How are you charging, per sample or per analysis?

AS:  We charge a fixed price per sample, and it all goes back to simplifying pricing for our users. Internally, we break the cost down based on compute cost, analysis cost, and download cost to determine the prices we offer our users. Some users may run a few more analyses, while some may run less, but establishing fixed pricing per sample really simplifies things for our users and allows them to set and stay within a budget. As such, our pricing model is all-inclusive and users can store their data for one year and even come back and rerun an analysis.

We support a range of data, from small re-sequencing efforts to whole genome sequencing. There are many different data types, and the fixed price per sample is established based on the size of the sequencing files. Whole genome sequencing would be the highest in cost, while small re-sequencing would be the lowest. The way we describe it to our customer: the average analysis cost is about one third of the sequencing cost. This is a model most customers can easily relate to. On top of this, we also provide volume discounts as well as a student plan. Students likely do not need to store data for a year (the student plan supports only one month of data storage), and they represent the most cost-sensitive customer segment. The first six samples are free.

For enterprise customers, we offer more flexible pricing depending on their preference. Storage is a big factor in the price — some only need the data for a month, while some prefer one-year access, and others plan for the next ten years. For sophisticated users, we use their existing cloud account and orchestrate the resources, and they directly pay the costs to the cloud provider.

EB:  Basepair works in the cloud. Which cloud are you working with?

AS:  As of now, we are predominantly using Amazon (AWS) as our main cloud provider, with very little additional activity in Microsoft Azure. For enterprise organizations, we even offer local Amazon cloud support, e.g. AWS Singapore.

EB:  Data is King! Making sense of data requires the incorporation of reference data, because one cannot really make sense of any data if analyzed in isolation. For that reason, an analyst ideally has access to all kinds of data. Do you, for example, provide access to various reference data for your researchers to compare it to?

AS:  We make it super easy for our users to compare data to any public data available, for example in the NCBI GEO (Gene Expression Omnibus) or the SRA (Sequence Read Archive) database. We have a functionality available within Basepair that allows a user to quickly import any data from those NCBI databases for analysis. As of now, we charge the same pricing when they upload those samples for analysis. But, the future plan is to import them all up front, though we still have to figure out a way to charge a flat fee for each user. It is on our roadmap to address and to make it quicker and cheaper for our users because it will be shared consumption.

Regarding clinical data analysis: right now that data is in silos. For example, Memorial Sloan Kettering is doing its own thing, Vanderbilt is doing its own thing. Ideally, this data will ultimately be combined, but there are huge hurdles such as data and patient privacy that prevent this from happening. It is one of the biggest issues the industry is facing right now. I don’t see a clear solution that would address all the concerns of the individual hospitals and institutions.

EB:  You mentioned that you come from the field of machine learning. Is there something Basepair is doing that will include machine learning?

AS:  On the diagnostic side, we are working towards a solution that would allow multiple labs to run their analyses on our platform. This would allow us to collect the data, and then build machine learning models based upon those data. That is something I have been personally very interested in — namely, connecting genomic data to outcomes for prediction purpose. It is on our roadmap, yes. For now, our goal is to grow the number of users on the platform. As more customers use Basepair, more data will be available, which will put us in a good position to develop and offer such a solution.

EB:  What do you see are the biggest challenges the genomics data field is currently facing and why? How are you affected by these challenges, and how do you address them?

AS:  I think the biggest challenge is finding the right tool and understanding how to use that tool for NGS data analysis. The sequencing itself is pretty straightforward. The challenge is associated with understanding how to interrogate the data, how to organize the data, and how to make sense of the data. It comes down to asking the right questions, and that is where Basepair come in — to help the user get to the answer. Researchers often think they can get the answer from talking to a bioinformatician, but often it all gets muddled up because a bioinformatician talks about what they can do for the researcher without actually explaining how the analysis part really works. There are scientists out there who are doing very cool experiments, and we try to close the gap between what their project needs are versus what the data can give them. We help them frame the questions better. There is so much more they could get from the data, but they don’t recognize it. Iyn the end the scientists themselves should be the drivers, and we should have the tools for them to move quickly in the right direction.

“NGS is not just one thing — it is an entire platform, and one can do many things with this platform: DNA sequencing, RNA sequencing, CRISPR validation, and much more.”

We often encounter scientists and PIs who have data and want someone to quickly find the answer they are looking for and create a figure so they can stick it into a publication. It is easy to use our software, but I believe scientists should still understand the details. It should not be a complete black box approach. NGS is not just one thing — it is an entire platform, and one can do many things with this platform: DNA sequencing, RNA sequencing, CRISPR validation, and much more. It is good to understand how the data has been processed, starting with the raw data, how the data is analyzed, what the results mean, and therefore what that ultimate figure represents.

Brigitte Ganter

Follow Me

%d bloggers like this: